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PREFACE 


Tuts book affords a general introduction to statistical method. It 
is intended primarily for those who are making their first acquaint- 
ance with the subject. The requirements of classroom use have 
been kept constantly in mind, and every effort has been made to make 
the material teachable. At the same time, the treatment of the sub- 
ject is comprehensive and designed to start the intelligent student on 
the road to a complete grasp of general statistical method. 

The emphasis of the book is upon the analysis, rather than upon the 
collection and tabulation, of statistical material. It is the author’s 
opinion that the great majority of students of statistics are not directly 
interested in the technique of collection and tabulation. They are con- 
cerned with these phases of statistical method only in so far as they 
affect the subsequent analysis of data. Furthermore, the more im- 
portant aspects of collection and tabulation can be fully understood 
only aiter the nature and purposes of analysis have been mastered. 
(True, some acquaintance with the technicalities of collection and tabu- 
lation is desirable at an early stage of the study of analysis; conse- 
quently brief accounts of these are given in a series of appendices.) In 
general, therefore, attention may be profitably focused at the outset 
on the problems of analysis, and this is the plan of the present volume. 

Special care has been taken throughout the book to make clear the 
logical relationships of the parts. This results in rather more formal- 
ism than would be desirable in any treatment other than that of an 
introductory text. But it is highly important at the beginning of the 
study of statistics to learn to place correctly in the general scheme of 
statistical method the processes that are to be undertaken in any par- 
ticular analysis. The student should be made to see how the individ- 
ual parts are related to the whole. 

In more extended form, the title of this book might be given as “The 
Logic of Statistical Analysis.”’ Several years of experience in teaching 


statistics have led the author to believe that it is much more important 
vil 
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for the beginning student to learn the logical setting of statistical 
analysis than it is for him to gain facility in any set of technical pro- 
cesses. Some command of technical procedure, of course, is necessary. 
But, after all, the objectives of statistical work lie in the interpretation 
of statistical results; and interpretation can never be wisely under- 
taken save with full recognition of the logical implications and limita- 
tions of the processes that have gone before. At the risk, perhaps, of 
being somewhat meticulous, if not pedantic, the present treatment 
has been written with the avowed intention of making perfectly clear 
the logic of statistical procedure. 

In view of this underlying purpose, it has not been thought neces- 
sary to enter upon the more refined mathematical phases of statistical 
method. Nothing in the book should prove inconsistent with the find- 
ings of the most advanced mathematical statistics. But no command 
of advanced mathematics is expected of those for whom the book has 
been written. For those who have special interests or preparation 
along mathematical lines, supplementary studies can be readily under- 
taken. Furthermore, it is to be hoped that some of those whosecure 
through this book their introduction to statistics will be encouraged to 
pursue the subject into its fascinating and important mathematical 
ramifications. A serviceable start in the understanding of statistical 
analysis may be made, however, with a modicum of purely mathemati- 
cal procedure, and in the present exposition of the subject the mathe- 
matical requirements have been reduced to a minimum. 

Attention should be called to one special feature of the book: the 
treatment of methods of graphic representation. It is customary in 
texts on statistical method to devote special chapters to statistical 
graphics. From a pedagogical point of view, this plan has certain 
advantages. It involves, however, the serious danger of divorcing the 
practices of statistical graphics from the processes of statistical analy- 
sis. Most of the abuses of statistical graphics have occurred because 
diagrams have not been made to exhibit accurately the results of analy- 
sis. A closer inter-relation of analytical and graphic practices is de- 
sirable. In the present text, therefore, questions of graphic method 
are dealt with wherever they arise in the explanation of the manifold 
phases of analysis. 

It is quite impossible for the author to state explicitly the extent of 
his indebtedness to others. With other workers in the field, he has 
profited, in part perhaps unconsciously, from the labors of many scholars. 
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However, some of those whose contributions have been found especially 
valuable should be specifically mentioned. All readers who are at all 
familiar with the literature of the subject will quickly sense the author’s 
heavy indebtedness to such eminent authorities as Bowley, Fisher, 
Mitchell, Pearson, Persons, and Yule. Numerous footnotes through- 
out the book indicate sources which have been explicitly utilized at 
particular points. The author’s colleagues, Professors O. W. Blackett, 
J. P. Mitchell, and C. S. Yoakum, have examined the manuscript and 
offered numerous valuable comments. Former. students — among 
whom perhaps should be particularly mentioned Professors W. A. 
Berridge and M. B. Hexter — have made many helpful suggestions. 
Finally, the author’s secretary, Miss Carolyn E. Allen, has given in- 
valuable and tireless assistance without which the book could hardly 
have been published at this time, if at all. 
Epmunp E. Day 


Ann ArBor, MICHIGAN 
September 20, 1925 
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STATISTICAL ANALYSIS 
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THE SIGNIFICANCE OF STATISTICS 


A NATURAL introduction to statistics is to be found in a simple 
working definition of the subject. This is true despite the fact 
that the nature of statistics can be fully understood only after 
acquaintance with the essentials of statistical analysis. No 
definition presented at the very beginning of study of the subject 
can be scientifically adequate and at the same time thoroughly 
intelligible. But a preliminary definition is entirely practicable. 
Such a definition, though it has to be later revised or amplified, is 
bound to prove helpful in making a first approach to the problems 
with which statistics has to deal. 

The task of formulating even a preliminary definition is un- 
fortunately complicated by the fact that conceptions of the nature 
of statistics have differed widely. Originally, statistics meant a 
body of information descriptive of the state. Subsequently the 
word was applied to such parts of this information as were in 
numerical form. More recently, statistics has come to refer to the 
methods employed in analyzing quantitative data from whatever 
field drawn. There appears to be a growing tendency to think of 
statistics in this latter sense — that is, as a phase of general 
scientific method. But substantial divergences remain. There 
is as yet no universally recognized definition of the subject. 

In the present work statistics is held to be a method of dealing 
with quantitative data. This conception considerably simplifies 
the problem of definition, but does not altogether dispose of the 
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matter. Considerable differences exist among those who agree 
that statistics is a phase of methodology. ‘Thus, statistics has 
been described as the “‘science of counting,” the “science of large 
numbers,” and the “‘science of averages.”’ Although all of these 
definitions point to significant aspects of statistical inquiry, they 
are all too highly abbreviated to be adequate. A more helpful 
conception of statistical method is given by G. U. Yule: “meth- 
ods especially adapted to the elucidation of quantitative data 
affected by a multiplicity of causes.” From many points of 
view, this definition is entirely acceptable. It becomes increas- 
ingly so as the more important aspects of statistical analysis are 
thoroughly understood. For the present, however, a simpler 
definition may be made to answer : statistics may be said to consist 
of those methods of investigation peculiar to the collection, tabulation, 
and analysis of quantitative data.” 

Thus defined, statistics obviously is applicable to any field 
of inquiry in which quantitative data are encountered. Statis- 
tical methods are fundamentally the same whether employed in 
the analysis of astronomical observations, the study of anthropo- 
metric measurements, the interpretation of the records of bio- 
logical experiment, the examination of educational data, the 
investigation of records assembled by sociological agencies, or 
the collection and analysis of quantitative material in some 
branch of economic science or business administration. . True, 
the emphasis of statistical method shifts as it is applied in one 
field or another; but at bottom the method is the same.? 

The value of statistics in scientific inquiry may be indicated 


1 Introduction to the Theory of Statistics, p. 5. 

2 This definition does not preclude the use of the word statistics in the sense of the quanti- 
tative data themselves. Thus, reference may be made to the statistics of foreign trade or 
of business failures. There is no objection to the concurrent use of the word statistics in the 
sense of method and of material. Practically no confusion arises since in one sense the 
word is singular, and in the other, plural. When it is desirable to distinguish between the 
two, the phrases “statistical method” and “statistical data” may be employed. 

3A clear understanding of the nature of statistics may be had from its application in 
any one of several fields. It is advantageous, however, with a view to effective instruction, 
to study statistics in some field in which the materials analyzed, as well as the methods 
employed, are of immediate interest. The present text is designed especially for those 
whose major interests lie in economics and business: statistical method is considered in 
the light of its many applications in these two important fields. 
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by a simple concrete illustration of a type of problem in the 
solution of which statistical investigation is especially service- 
able. Suppose that a serious traffic situation has developed in 
an urban center owing to the rapid increase of the use of auto- 
mobiles. ‘To solve the problems involved, detailed information is 
needed — information, for example, regarding conditions on 
particular streets and at certain intersections; the hours of the 
day when congestion is most serious ; the types of traffic involved 
at stated points. Information of this sort is to be had only after 
a series of steps has been taken. The number and types of motor 
vehicles passing selected points during stated periods must be 
observed and recorded. The data secured by this means must 
be assembled and tabulated on some carefully devised plan. 
The totals, resulting from the tabulations, must be appropriately 
analyzed and interpreted, and the conclusions thus finally arrived 
at presented in some effective fashion. By these means the 
immediate facts of the local traffic situation may be made clear, 
and the way prepared for intelligent action. The problem has 
its solution in findings based almost entirely on statistical inquiry. 

The value of statistics may be further indicated. Administra- 
tive practices, as well as scientific inquiry, rest increasingly on 
statistical method. The point is clearly stated in the following 
quotation : 

“The general manager of a great railroad may in some cases spend half 
his time in inspection, but under normal conditions he does not actually 
direct any part of the movement of trains. If he actually saw one per 
cent of his company’s business moved, it would be an occasion for com- 
ment, and so, since the eye of the master is so limited in its vision, other 
means of control for the property must be provided. Roughly speaking, 
the method of accomplishing this is to split all the operations of the road 
into rigid, clearly defined units, and then to compare these units with simi- 
lar ones on other railroads, or with the same railroad at other periods, or 
with arbitrary standards set up like yardsticks. Railroad control through 
statistics might be defined as the process of finding the unit in each opera- 
tion, of seeing to it that these units are rigid, that they are collected and 
reported accurately, promptly, and economically, and then of taking the 


necessary measures to correct the defects which they indicate. Statistics 
are the clinical thermometer of industry.” } 


1 Ray Morris, Railroad Administration, p. 226. 
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Examination of a large variety of concrete investigations 
shows that the value of statistics springs from at least two distinct 
contributions. | In the first place, statistical inquiry is the only 
possible means of apprehending one highly important type of 
phenomenon. This is the phenomenon which inheres in a 
large collection or aggregate of individual cases. The size of 
New York City, the nature of a vice problem, the course of an 
influenza epidemic, the structure of a metropolitan area, the flow 
of foreign trade, the exploitation of natural resources, changes in 
the cost of living, the potentialities of a market — these and 
countless other phenomena are by nature extensive and complex.? 
They cannot possibly be encompassed in a single observation. 
Only through the analysis of quantitative data can they be 
known. Statistical method in the study of such collective 
phenomena is indispensable. 

In the second place, statistics renders an incalculable service 
in the analysis of all forms of variation. Variation is one of the 
most universal aspects of experience. Things are infinite in 
variety. Furthermore, they are forever changing. Experience 
never perfectly repeats itself. Knowledge of the universe about 
us has to be extracted from comparisons; from a study of likes 
and unlikes ; from an examination of similarities and dissimilari- 
ties. At bottom, exact knowledge rests upon the measurement 
and interpretation of differences. Whenever these differences 
find expression in quantitative data, they may be made the sub- 
ject of statistical inquiry. Statistics is an essential means for 
interpreting the world of variables in which we move. 

A large part of statistical work takes the form of a splitting up 
of differences into their several elements. Thus, the difference 
noted in the items of a time series may be resolved into a number 
of distinct differences having their origin in the growth of the 
factor, its seasonal variation, its fluctuation in response to the 
business cycle, its change owing to some cataclysmic force, 
and its accidental fluctuation on account of numerous indetermi- 
nate influences. Differences in the original items may be largely 


1 They may be referred to as collective or mass phenomena. 
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unintelligible, simply because they represent the combined in- 
fluence of a number of independent forces. A great deal of 
statistical analysis is devoted to the isolation of elements, and the 
separation of the differences which may be ascribed to each of 
several factors. It is by this means that the image of the variable, 
frequently complex and confused in the original data, is ren- 
dered simple and distinct. 

The importance of this particular function of statistical analysis 
can hardly be exaggerated. Most of the material dealt with by 
statistical method in economic and business science has to do 
with subjects which are not susceptible to laboratory experiment. 
The opportunities of complete isolation of forces and their 
manipulation under observation, which constitute the foundation 
of laboratory experiment and thereby the source of the bulk of 
scientific contributions in recent generations, lie beyond the 
reach of those who are charged with responsibility for the direc- 
tion of social or industrial affairs. Careful examination of the 
numerical evidences of actual experience is the most effective 
substitute for actual experiment. Through systematic analysis 
of available records, through careful breaking down of observed 
differences into their several elements, or separation of items so 
as to render differences in particular groups of a more simple 
character, sound and significant inferences may be drawn from 
the numerical evidences. The process falls short of the effec- 
tiveness of laboratory experiment in the demonstration of the 
exact relationship between variable elements, but in many fields 
statistical analysis is the only available means for setting up a 
provisional explanation of the differences which are encountered 
in everyday experience. 

In stressing the possible services of statistics, the limitations 
of the method should not be overlooked. Statistical analysis 
is only applicable to those phenomena which can be expressed in 
some quantitative fashion. Many problems cannot be stated in 
quantitative form. Sometimes such problems may be studied 
in their indirect evidences. At other times they are to be re- 
garded frankly as lying beyond statistical reach. Furthermore, 
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the utmost cultivation of statistical evidences often leaves 
exceedingly difficult problems of alternative interpretations. 
Despite such limitations, statistics is a method of surpassing 
importance. There is a vast range of phenomena, still un- 
touched, which may be expected in time to yield to statistical 
analysis. 

The specific objectives of statistical inquiry vary widely. The 
purpose of an investigation may be a more exact determination 
of important characteristics of an individual object, or a more 
definite measurement of likenesses and differences in several 
objects. Or statistical analysis may be undertaken to break 
some composite into its several significant components ; to ascer- 
tain characteristic movements in time, such as secular trends or 
recurrent movements of a cyclical sort ; or to disclose and describe 
persistent relationships, or possibly actual causal connections, 
among two or more objects or phenomena. The discovery of 
trend and the forecasting of future changes is a subject of never- 
failing statistical interest. Then, too, statistical inquiry is being 
constantly directed toward the discovery of important rates 
or ratios or constants which may be employed in the establish- 
ment of standards or norms of experience or performance. 
Finally, in the detection and definition of “the reign of law in 
the social actions of men,” statistics finds some of its most ab- 
sorbing quests. The immediate objectives of statistical method 
clearly exhibit wide diversity. 

All these objectives, however, have one common element : 
statistical inquiry always involves the analysis of quantitative 
differences. Only through the careful measurement, systematic 
comparison, and close interpretation of observed differences 
are dependable statistical results to be obtained. The specific 
objectives of statistical inquiry may vary; but the general 
purpose of all statistical investigation is the analysis and inter- 
pretation of quantitative differences. 

The influence of statistical method is rapidly increasing. A 
number of considerations are responsible for this. In the first 
place, methods of dealing with quantitative records are of in- 
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creasing interest because of the growing need of more precise 
measurement in human affairs. As man acquires greater control 
over natural forces, and as he finds himself more and more in- 
volved in elaborate social relations, accurate knowledge of 
physical and social forces becomes more and more imperative. 
-Much now depends, for example, on accurate records of the 
weather, on more complete knowledge of the causes and con- 
sequences of accident and disease. We are vitally concerned 
nowadays with the forces of public opinion, the results of educa- 
tion, the requirements of economic well-being. The increasing 
scope and magnitude of our economic and business arrangements 
rest on a more objective measurement of the various factors 
involved. In such a complex life as ours, there is need of more 
specific information than would be necessary or particularly 
useful in a simpler order of civilization. 

In the second place, the growing significance of statistical 
research is but a natural accompaniment of the spread of the 
modern scientific spirit. There is a general disposition to 
attempt accurate measurement of the elements involved in any 
complex situation, with a view to a more intelligent understanding 
of the inter-relationships of the determining factors. The ex- 
traordinary achievements of scientific method in the study of the 
physical universe have naturally led to a spread of the scientific 
attitude in the investigation of social institutions and human 
affairs. We are no longer content with rough deductions from 
more general truths. We seek constantly to extend the bounds 
of exact scientific knowledge. In this endeavor we inevitably 
encounter quantitative data which can make their contribution 
only through statistical analysis. 

In the third place, statistical method is acquiring greater and 
greater prestige because in some of the fields of inquiry now under 
the influence of the scientific spirit, certain of the methods of 
science are inapplicable. Laboratory experiment, as has been 
noted, can hardly be employed in an investigation of most social 
complexes. There is no opportunity for the isolation of factors 
and the nice manipulation of elements which have contributed so 
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enormously to scientific achievement in other directions. Con- 
clusions have to be drawn, instead, from careful analysis of the 
data of practical experience in the field. Inferences have to be 
extracted from large masses of quantitative data instead of from 
the more compact records of laboratory experiment. Statistical 
analysis thus comes to play a much more essential part in the 
use of scientific methods for the extension of knowledge. 

It has been stated that statistics has to do with the collection, 
tabulation, and analysis of quantitative data. Corresponding 
to these three elements in the treatment of quantitative data are 
three phases of statistical method. The technique of collecting 
satisfactory data is largely distinct from that of tabulating the 
returns, or of analyzing the aggregates. True, satisfactory 
collection and tabulation are hardly possible unless there is 
thorough appreciation of the problems of analysis. On the other 
hand, it is almost equally true that sound analysis is conditioned 
in substantial measure by full realization of the difficulties of 
collection and tabulation. All phases of statistical method are 
closely related. 

At the same time, the technique of statistics varies funda- 
mentally from phase to phase. Commonly the different lines of 
work are carried on independently. A statistician may be expert 
in handling a large-scale enumeration, and still know relatively 
little of the intricacies of analysis. Conversely, a statistician 
may be thoroughly competent in the mathematics of refined 
statistical analysis, and be woefully lacking in ability to organize 
an actual count of a statistical character. There is in practice 
a fairly definite separation of personnel in the various phases of 
statistical work.2 Presumably, this separation of personnel re- 
flects, in part at least, differences in the requisites of satisfactory 


1 On the relations between statistical and laboratory methods of scientific investigation, 
see the illuminating discussion in Raymond Pearl’s Modes of Research in Genetics, chap. III. 
2 Government offices commonly concern themselves largely with the assembling of 
statistical material. They work out the problems of collection and tabulation, and to a 
large extent leave to those who later employ the information the work of final analysis. A 
great deal of the work done by academic statisticians, on the other hand, consists of the 


analysis of data tabulated and published by agencies with which these statisticians have 
had no direct contact. 
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performance in the various lines of statistical work. Administra- 
tion of a great population census — with all its intricate problems 
of organization and management, its thousands of workers, 
its millions of returns — makes demands upon the statistical 
expert quite unlike those involved in a statistical analysis. 
Differentiation in statistical work is fundamental, and may well 
be recognized in the organization of instruction in the subject. 

After all, however, analysis is the basic element in statistical 
method. It is in analysis that the real objectives of statistical 
investigation eventuate. An appreciation of the problems of 
analysis constitutes the beginning of wisdom in statistical work. 
The present text, therefore, with full recognition of the close 
inter-relationships of the three different phases of statistical 
method, will concern itself for the most part with analysis) It 
will endeavor to develop in simple and orderly fashion the funda- 
mentals of analysis, giving consideration to problems of collection 
and tabulation only to the extent to which a knowledge of these 
phases of method is necessary to a full understanding of the 
technique of analysis. 


CHAPTER II 


VARIABLES AND STATISTICAL UNITS 


THE general purpose of statistical inquiry is the analysis and 
interpretation of variables. Knowledge of the nature of variables, 
of their more important forms, and of the units in which they 
are expressed, is necessary at the very outset of statistical 
analysis. These subjects will be dealt with, therefore, in the 
present chapter. 

A. VARIABLES 


Definition of the variable is a simple matter: a variable is 
anything which exhibits differences of magnitude or of number. 
The variable may be either concrete or abstract. It may be an 
attribute of some definite object. It may be a mere ratio. 
The one essential characteristic of the variable is that it exhibit 
discernible differences. 

Examples of variables appear on every side. A familiar vari- 
able is one’s pulse. Other illustrations are one’s weight, 
the statures of one’s acquaintances, the population of the com- 
munity in which one lives, the price of strictly fresh eggs at the 
near-by grocery, the output of the largest local factory, the 
number of failures among the business enterprises of the section, 
the ratio of operating expenses to net revenues among these 
business concerns. Obviously, examples might be multiplied 
indefinitely from factors close at hand or far afield. The world 
is, in fact, one great mass of variables. The general nature of 
the variable is thus readily observable. 

But it is not enough to recognize the general nature of the 
variable. Variables exhibit important differences. These merit 
careful examination. Some of the more significant of the dif- 
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ferences will be brought out if attention is first focused on the 
processes through which the records of the variable are made 
available. 

The values of some variables are arrived at through a mere 
tabulation of individual cases. The expanding population of 
the United States is officially recorded once in ten years in a 
nation-wide census under which all the individual inhabitants 
are first systematically reported and then carefully counted. 
The population from one decade to another appears as a variable 
the values of which are obtained from a straightforward counting 
process. 

All sorts of classifications result in variables which are similarly 
obtained by counting individual cases ; for complete classification 
implies not only a grouping of the individuals of the total “‘popu- 
lation,”’ but an enumeration of the cases falling within each group. 
The variable “frequencies” which are thus obtained are obviously 
counted variables ; they result from simple tabulation. 

The values of other variables, in contrast, are obtained by 
means of specific measurements! Atmospheric pressure at 
different weather observatories is an illustration. The price of a 
bushel of wheat at the opening of the Chicago wheat pit on the 
first of each month is another example. Original observation in 
such cases involves the determination of a measurable magnitude. 
The records are quantitative ab initio.” They result from the 
process of measurement. 

The records of still other variables are secured through com- 
putations of one sort or another. The daily average production 
of pig iron in the United States by months is a concrete example ; 
the average yield per acre of potatoes in Michigan by counties 
is another. The characteristic feature of such variables is that 
they are derivatives. They emerge in the results of arithmetic 
or mathematical processes. 

Some of the most important variables of this type are recorded 
in simple averages; the monthly average price of raw cotton at 


1 Among the elements most commonly measured are size, weight, duration, and value. 
2 Tabulated variables, on the other hand, acquire quantitative character only through 
the process of counting. See Yule, Introduction to the Theory of Statistics, p. 7. 
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New York may be cited as an example. Variables of this sort, 
stated as averages, have distinct advantages. The price of wheat 
in the Chicago market may vary so widely from day to day as to 
make daily reports rather difficult of interpretation ; the monthly 
totals of steel output may give rise to mistaken conclusions, owing 
to the variable number of working days in the several months. 
Analysis under these conditions may profit greatly from an initial 
conversion of the daily or monthly quotations into appropriate 
averages. The derived variable is closely related, of course, to 
the original. In between the two, however, lies the process of 
averaging —one of the most important phases of statistical 
_ method. The questions raised by this averaging will be con- 

sidered later. It is enough to note here that anne =O as 
averages from measured or counted variables sometimes possess 
distinct advantages arising from the averaging process, and are a 
form of variable of growing importance. 

Still another form of derived variable is obtained by striking 
some sort of relative between two or more variables already re- 
corded. Variables of this kind may be called compound. Illustra- 
tions abound. A water company may keep careful records of both 
the number of water consumers and the amount of water con- 
sumed. From these two it may readily calculate the number of 
gallons consumed per capita per day — obviously a compound 
derived variable. A railroad company may keep accurate records 
of the number of passenger trains operated and of the number of 
passengers carried. From these two, a simple calculation obtains 
the average passenger train-load — another compound derived 
variable. In each of these instances the variable results from a 
combination of two other variables. 

Derived variables of this general type assume many forms. 
One of the most common of these makes allowance for variations 
in the relevant population. Let us take a concrete illustration. 
The relative extent to which the telephone has been adopted as 
a means of communication in the United States and certain 
South American countries is hardly indicated by a statement that 
on January 1, 1919, there were about 12,000,000 telephones in the 
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United States, 105,000 in Argentine, 67,000 in Brazil, and 24,000 
in Chile; differences of population account for the greater 
part of the differences in the number of telephones used. 
Much more light is thrown on the situation if it is stated that the 
number of telephones per roo population in these countries was as 


follows : 
United States 11.39 


Argentine feet 
Chile 0.59 
Brazil 0.25 


Here allowance has been made for the differences of population. 
Derived variables designed to make such allowances are highly 
serviceable. 

Sometimes, instead of making reference to the total underlying 
population, reference is made to such numbers of the population as 
are directly concerned in the variable in question. Thus, instead 
of giving the number of unemployed workers per thousand of the 
total population, the number per thousand of the adult wage- 
earning population may be used as the base. In similar fashion, 
the number of births in a population may be related to the number 
of married women of child-bearing age instead of to the total 
number of the population; or the number of bushels of wheat 
produced may be given per thousand of farming population in- 
stead of per thousand of the total population. Rates or ratios 
which are thus given on the base of numbers in selected groups of 
the population, instead of numbers in the population as a whole, 
are referred to as refined, in contrast to the crude rates which 
cite differences per thousand or per million of the total population. 
Refinement of rates in this fashion sometimes adds greatiy to 
their statistical significance. 

Another type of relative which serves the same purpose as those 
just considered, is the common per capita figure. Here, instead 
of using the base of 100, or 1000, or 1,000,000 of the population, 
the base is the single individual. Thus, the amount of sugar 
consumed in different countries is usually stated as so much 
per capita. The relative in this case obviously accomplishes 
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the same result as the rates expressing variables per thousand or 
per million of population. The difference of the two forms is 
altogether numerical. 

Sometimes the original variable is to be related, not to popu- 
lation, but to some other underlying factor, such as area. For 
example, the total corn crop in the United States may be related 
to the area planted to corn by giving the yield of corn per acre. 
Or, to take another illustration, the population of different 
countries may be given as so much per square mile, thus affording 
an index of the density of population. Relatives on the base of 
units of area are an important form of derived variable. 

In precisely the same fashion an original variable may be re- 
duced to a relative of the unit of time. Thus, monthly cotton 
consumption in the United States may be reduced to average 
daily cotton consumption by dividing through by the number of 
working days in the individual months. Relatives reduced to 
this ‘form are commonly preferable to the original variables from 
which they are extracted, since they make proper allowance for 
the differences of elapsed time in the intervals upon which the 
original aggregates are based. 

Of course, population, area, and time are merely the most 
common bases of reference in these derived variables. In some 
instances, other bases of reference are in order. Thus, the 
number of insurance claims paid by an insurance company may 
be related to the number of thousands of policies in force. The 
variable then appears as the number of insurance claims paid 
per year per one thousand policies in force. Similarly, the 
number of positions available at an employment office may be 
reported with reference to the number of applicants. The 
important consideration in formulating variables of this sort is 
to relate the variable to the most significant underlying factor. 

In railway transportation, a large number of derived variables 
have been introduced, some of them of a more or less complicated 
nature. Thus, we have the total ton-miles per car per day, or the 
ton-miles per engine-hour; the number of car-miles per car- 
day; the number of tons per loaded car; the percentage of 
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freight cars loaded to total car-miles; the average length of haul 
for passengers and freight respectively; the average receipts 
per passenger-mile and per ton-mile ; miles per car-day ; the per 
cent loaded car-miles ; ton-miles per car-month. 

In other fields also there is a tendency to devise these com- 
pound variables. In an attempt to measure labor efficiency, 
records may be obtained of the output per man per hour. In 
a financial analysis, close attention may be paid to the per- 
centage of sales for cash to all sales, or to the percentage of total 
recelvables to net sales; to the relation of current assets to 
current liabilities; or to the ratio of operating profits to total 
capital employed. These and countless other derived variables 
appear in statistical analysis. Their formulation constitutes 
one of the most important phases of statistical investigation. 

Derived variables commonly find expression in relative, not 
absolute, numbers.’ The steps taken in obtaining these numbers 
frequently obscure the nature of the original observations upon 
which they rest, and may give rise to statistical results which are 
at times misleading, if not actually in error. Great care has to 
be exercised, consequently, in the statistical treatment and 
interpretation of derived variables. It is of the utmost im- 
portance that they be set up in precisely the right form. The 
full significance of this will appear when the technique of statis- 
tical analysis is more fully developed later in the present study. 
Meanwhile, the distinction between original and derived vari- 
ables may well be emphasized, that derived variables may be 
quickly identified and a habit of caution cultivated in their 
analysis. 

Certain distinctions of form in the variable have a fundamental 
bearing upon subsequent analysis. One of the most important 
of these distinctions has reference to the continuity of variation. 
On this basis, two forms of variables are to be recognized : (1) con- 
tinuous, and (2) discrete. 


1 Derived variables expressed in absolute numbers are not unusual, however. The 
increase of the population of the United States by decades in thousands is an example. A 
number of the derived variables cited in the preceding paragraphs are also in the form of 
absolute numbers. 
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Continuous variables may assume within the limits of their 
range any value whatsoever. No precision of measurement 
ever effects more than an approximation to the actual magnitude 
of the variable. Tonnage of vessels is a continuous variable. 
So is the temperature of a room. However accurately we may 
attempt to measure these, we know that further refinements of 
measurement are possible and that any statement of magnitude 
is of the nature of an approximation. 

In contrast to variables of this sort are those which can assume 
only certain positions on the scale. Size of family is a variable 
of this sort. There may be families of two, three, four, or more 
members, but naturally no families showing fractional parts. 
The number of passengers carried on successive trips of a motor 
bus is another variable of this kind. Careful regard for complete 
accuracy of observation may yield a precise determination of the 
size of such a variable. Measurement is not of necessity in the 
nature of an approximation. The variable can assume only 
certain clearly defined positions on the scale and these positions 
may be exactly noted; in between these positions on the scale 
no items can possibly occur. A variable of this type is referred 
to as discrete, or discontinuous. 

Many variables, though discrete in character, may be fairly 
dealt with as if continuous, for the reason that the variables run 
to subdivisions of the scale which are so small in comparison with 
the general size of the variable that, for all practical purposes, 
the condition of continuity is attained. The population of 
municipalities is an illustration of this case. The variable cannot 
occur in fractional parts, and hence is logically discrete in char- 
acter. In other words, with due care in observation, the popula- 
tion can be exactly determined ; measurement is not of necessity 
in the nature of approximation. Despite this fact, the variable 
presents a condition which is much like that of continuity. 
The magnitudes assumed by municipal populations are so great 
compared with the smallest possible differences in size — namely, 
differences of a single individual —that no material error is 
involved in analysis if the variable is treated as if continuous. 
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A great many variables, the records of which are obtained by 
counting, are to be dealt with in this fashion. Since these are 
expressed as aggregates of individual cases, they are as a rule 
logically discrete. At the same time they commonly assume 
proportions so large that differences of one or two, or even more, 
are very small relatively. The variables though discrete in 
principle may be analyzed as if continuous. 

Another large group of variables, expressed in terms of the 
monetary unit, exhibit the same condition. These are discrete 
in principle for the reason that no subdivision of the smallest 
monetary unit is possible. For example, it is impossible that a 
man’s actual personal income should run to fractional parts of a 
cent, since payment cannot be made in smaller subdivisions of a 
dollar than one cent.1_ Nevertheless, if incomes run to large sums, 
the single cent is so small a part of the ordinary size of item that 
the condition of continuity is attained for all practical purposes. 
As in the other type of case cited, the variable is logically discrete, 
but may be dealt with as if continuous.” 

Every variable, whatever its form, involves the use of a unit. 
The unit may be either concrete or abstract, either simple or 
compound ; but there must be a unit of some sort if the magni- 
tudes of the variable are to be intelligible. Light will be thrown 
upon the nature of the variable, therefore, if the qualities of 
statistical units are briefly considered.* 


Bee oraristican UNITS 


The unit in which many variables are expressed is the individual 
case. If a census of the population of New York City is taken 


1Tt should be noted that not all variables quoted in the monetary unit are logically dis- 
crete. Some are unmistakably continuous. Thus, though the actual aggregate income of a 
number of groups of industrial workers is a discrete variable, the average individual income 
in these groups is a continuous variable. Clearly, there is nothing to prevent the calculation 
of this average to any degree of nominal precision, and the averages may assume any posi- 
tions whatsoever within the limits of their range. 

* At times the assumption of continuity is not only permissible but desirable. The 
advantages of the assumption will appear later (see Chapter X, for example). It is to be 
recognized at all times, however, that variables in the forms cited are logically discrete, even 
if for purposes of practical analysis they may be dealt with as if continuous. 

3 See G. P. Watkins, ““The Relation between Kinds of Statistical Units and the Quality 
of Statistical Material,” Quarterly Journal of Economics (Aug. 1912), vol. XXVI, pp. 
673-702. 
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at stated intervals, the individual resident is the unit in which 
the variable population is stated. Similarly, if a classification 
by size is made of farms in the United States, the variable number 
of farms in the several classes is expressed in terms of the in- 
dividual farm as a unit. A great many variables in economics 
and business are stated thus in units of the individual case.’ 

Among units of this sort, three varieties may be recognized : 
(x) natural objects — such as the horse — given as to general 
form and character by nature; (2) concrete produced things — 
such as the house — material, tangible, but having no definite 
form set by nature; (3) institutional objects — such as the 
citizen — having existence only in connection with some political, 
economic, or social process, or institution, or practice. 

All subsequent statistical analysis depends upon satisfactory 
definition of these units consisting of individual objects. If 
definitions are vague or inexact, things which are unlike for the 
inquiry in question will be dealt with as if like. It is clearly of 
the utmost importance that the units remain essentially the same 
throughout any given investigation. This requires simple, unam- 
biguous, fixed definition of the object serving as unit. A brief 
consideration at this point of the problems involved in such 
definition will be instructive. 

The task of definition is simplest in the case of well-known 
natural objects. Thus, if a census of cattle in the United States 
is to be undertaken, the object of inquiry has been previously 
defined in terms which are readily applicable. True, there will 
have to be consideration of the question of excluding young stock ; 
but aside from this, no serious problem of definition arises. 
Natural objects are ordinarily recognizable without confusion. 
The counting of well-known natural objects is the simplest case 
of statistical observation. 

Produced objects are not so easily distinguished. In what 
terms, for example, is the. farm to be defined? What of the 


1 Not infrequently it is convenient to express large aggregates in multiples of the in- 
dividual case; e.g. cotton machinery in millions of spindles, or automobile production in 
thousands of cars. (See Table 29 for illustration.) It is preferable in such cases not to use 
intermediate multiples, such as tens or hundreds of thousands. 
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garden or pasture of the villager? What of the orchard of the 
suburbanite? In a standard dictionary we may find the farm 
defined as “any tract of land devoted to agricultural purposes 
under the management of a tenant or the owner.” The 1920 
Federal Census defines the farm as “‘all the land which is directly 
farmed by one person managing and conducting agricultural 
operations either by his own labor alone or with the assistance of 
members of his household or hired employes.”” The differences 
between these definitions are instructive. One is designed to 
convey a general idea ; the other is so set up as to avoid difficulties 
in the actual observation of specified cases. Evidently, defini- 
tion of a produced object, even one of a simple sort, requires 
much thought, an exact use of words, and a careful regard for 
practical considerations. 

The distinction between natural and produced objects is not 
always perfectly clear. The character of even so-called natural 
objects may be considerably affected by the arts of production. 
Thus, progress in the breeding of cattle may considerably im- 
prove the stock. The qualities of whole milk may be consider- 
ably affected through the careful selection of dairy stock, or the 
shifting of herds from one breed of cattle to another. Whole 
milk as delivered within city limits to-day is quite different from 
whole milk as delivered within city limits twenty years ago. 
Formerly, the milk was commonly sold in bulk ; now it is almost 
universally bottled. Formerly, there were few recognized 
standards of cleanliness. Now great care is taken to keep the 
milk clean and wholesome. A marketed quart of milk to-day 
goes by the same name as formerly, but it is in many respects 
altered. It is an object which is partly natural and partly 
produced. 

The objects enumerated in many statistical inquiries are 
neither “natural” nor ‘“‘produced,’”’ but come into existence in 
connection with some political, economic, or social institution or 
process. The definition of such apparently obvious things as the 
citizen, or the pauper, or the immigrant, or the day-laborer may 
present serious difficulties. Take the case of the immigrant. 
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Is an alien in the country for temporary employment an immi- 
grant? Is an alien coming into the country for extended travel 
to be considered an immigrant? Is a minor child admitted to 
the country with its parents, whose stay is problematical, to be 
regarded as an immigrant? These and many other questions of 
similar sort have to be answered in determining what shall be 
considered an immigrant for the purpose of statistical inquiry. 
Similarly, definition of the citizen, the pauper, the criminal, 
the employer, the banker, the patriot, the manufacturing estab- 
lishment, the city, presents serious difficulties. Countless other 
objects, not of the natural and produced sort, present in varying 
degree similar problems of definition. 

The difficulties of definition multiply rapidly when explicit 
qualifications are introduced. Suppose a count is to be made of 
sailing vessels in the American merchant marine. Under what 
conditions is a vessel to be regarded as a sailing vessel? Is it 
such if it carries auxiliary power, say in the form of a Diesel 
engine? The answer presumably must be made with due regard 
to the extent to which this auxiliary power is employed. If the 
power is utilized only when conditions of wind and weather make 
dependence upon sails ineffectual or unprofitable, the vessel 
may still be classified as a sailing vessel. But possibly the motor 
power is used constantly and the sails employed as accessories. 
The situation manifestly is one in which definition is not at all 
easy. Much has to be known of the customary operation of the 
vessel to make the classification of the vessel significant, and it is 
not at all unlikely that the same vessel may well be variously 
classified at different times. 

Similar questions are raised by a multitude of other cases. 
Citation of a few will suffice. What is a boarding-house; a 
department store; a traveling salesman; an educated man; an 
unemployed worker; a business corporation; a freight locomo- 
tive? Clearly, explicit qualifications attaching to an object 
increase substantially the difficulties of definition, and make 
more than ever imperative the task of setting up exactly the unit 
of inquiry. 
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Not all variables are stated in units of the individual case. 
Many are expressed in terms of some simple unit of measurement. 
Such a unit may be either concrete or abstract. The former 
variety may be considered first. 

Simple concrete units of measurement may be divided into the 
physical and the pecuniary. Among the physical units are many 
of those most commonly employed in everyday life, such as the 
foot, pound, bushel, acre, minute. Most of these, employed 
universally for purposes other than statistical analysis, have 
already been well defined. They present fewer complications 
than any other type of unit of measurement in the statement of 
the magnitudes of variables. 

Some physical units, however, raise considerable statistical 
problems. Thus, the ton as a unit of weight is not employed 
uniformly. Short tons, long tons, metric tons, are all in common 
use. When ship tons (net, gross, and deadweight) as units of 
measurement of ship capacity are added to the list, it is readily 
seen that the ton as a unit of measurement is distinctly am- 
biguous. Such divergences in the meaning of units are multiplied 
when the systems of different countries are brought together. 

Such divergences introduce a complication which should be care- 
fully noted in each instance and dealt with as the requirements of 
investigation may indicate. 

Pecuniary units serve in the measurement or statement of 
values. They are customarily defined in the laws of each country. 
Thus, the dollar is a coin containing 25.8 grains of gold nine-tenths 
fine. The pound, the franc, the mark, the lire, are other widely 
used pecuniary units. 

Pecuniary measures obviously lack uniformity. In the first 
place they differ from country to country. This is true even when 
formal definitions alone are considered. When relative pur- 
chasing power, or value, is analyzed, wide divergences may be 
disclosed in units which may be nominally equivalent. Further- 
more, when values at different times are examined, even greater 
disparities are to be noted. The purchasing power of a unit 
may undergo substantial changes in comparatively short periods 
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of time; and in longer periods may be revolutionized. The 
dollar of 1920 was very different from the dollar of 1914. Pecun- 
iary units have always been subject to such disparity, and will 
probably remain so in considerable measure despite efforts to 
stabilize their values. This variable significance should always 
be borne in mind in dealing with pecuniary units in statistical 
observation and analysis. 

Simple concrete units do not always suffice in the measurement 
of variables; compound units are necessary in some kinds of 
observation. Take, for example, an investigation having to do 
with performances in some particular line of activity or service. 
It may be desirable to analyze the work done by laborers on a 
given contract, or by operatives on a given machine, or by stu- 
dents on a given assignment, or by freight trains on a given run, 
or by a dynamo in an electric power plant over given periods of 
time. Analyses along such lines obviously involve measure- 
ment; but performance is hardly to be gauged in terms of any 
simple physical or pecuniary unit. Combinations of simple 
units are required. Thus, physical units of mass or weight may 
be combined with units of time or distance or space. Examples 
of such composite units of measurement are the ton-mile and the 
passenger-mile, the engine-hour and the machine-hour, the labor- 
day (employment of one laborer for one day of eight hours), 
the kilowatt-hour. Clearly, these compound units are somewhat 
artificially devised. Furthermore, they are by no means stand- 
ardized in many lines of investigation. They are nevertheless 
indispensable instruments of accurate observation of perform- 
ances where the extent or scope of the task or assignment is not 
set or standardized.! 

Abstract units of measurement are by no means as common as 
concrete units, but occur in many statistical analyses. For 
example, an index number of price movements is likely to be 
stated in units of one per cent above and below some base number. 


1 Where the task has been standardized in some way, the performance may be indicated 
by elapsed time. Thus, the performance of sprinters may be sufficiently indicated by stating 
their records in seconds in the hundred-yard dash, which is, of course, one of the recognized 
events in this field of sport. 
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Interest rates similarly are given in units of one per cent. Other 
more complicated abstract units are frequently derived in the 
later stages of statistical analysis. In general, abstract units 
are by nature fixed, and present no serious problems of definition. 
They are not always self-evident, however, and are to be care- 
fully noted when encountered. 


Variables emerge under diverse conditions. Thus, a variable 
element may appear in a single object at different times! The 
body temperature of a hospital patient from hour to hour is a 
variable of this sort; the efficiency of a factory operative from 
day to day is another; the net profit of a business enterprise 
from year to year, another. Some of the most interesting 
analyses in statistical work have to do with variables appearing 
under this relatively simple condition. 

Another condition of somewhat similar sort is that of the 
variable element in a single object at different places.2 Atmospheric 
pressure at different weather-stations at a given hour is a variable 
of this type; so also is the price of a dozen strictly fresh eggs at 
different points in the United States on a designated day. The 
“individual object” in cases of this sort is, of course, such only 
in a generic sense; the numerous pressures really relate to dif- 
ferent bodies of atmosphere ; the various prices to different dozens 
of eggs. But the different individuals in these cases are to be 
regarded as identical with one another. For the purpose of 
analysis they are held to be essentially indistinguishable.? The 
variable element may therefore be thought of as appearing in a 
single object at different places. 


1Tt is to be noted that the single object may be collective in character. The population 
of the United States, for example, may be regarded as an individual entity. 

2Tt is to be noted that time and space are themselves variables. ‘True, they are not 
given explicitly in simple magnitudes. But if some point of reference is taken, any given 
moment in time or position in space, may be expressed in terms of a stated period, or distance, 
in a given direction from the point of reference. Considered in this way, time and space 
clearly exhibit differences of magnitude. They are therefore to be regarded in statistical 
analysis as simple variables. 

3 This line of reasoning applies to some cases of variation in a single object at different 
times, 
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Still another condition is that of the variable expressed in 
repeated measurements of the same olject when time and place 
have no influence upon the records. The surveyor engaged in 
triangulation may take repeated readings of the angle between 
two distant points. However great the care he may exercise 
to free his observations from outside disturbances and to make his 
readings accurate, the results will vary from one observation to 
another. Variables in this form are common in natural science, 
but rarely encountered in economics and business. However, 
they are of great theoretical importance. 

A closely related case of variation appears in the differing 
magnitudes of a single observed element in a number of related 
objects held to be unaffected by time and place. The hat sizes 
of customers in a man’s clothing shop differ at any given time; 
so do the customers’ purchasing powers. Variables emerge in 
this way in a large number of economic and business analyses. 

Many variables appear as differing frequencies of occurrence — 
that is, numbers of cases falling within the different subdivisions 
of a classification. Thus, if the individuals of an entire com- 
munity are classified according to their personal incomes, the 
number of individuals falling within the several classes is a 
variable. Every classification, whatever its basis, results in the 
formation of groups of cases. Since the number is bound to 
vary from group to group, a variable results. 

The characteristics of variables and the ways in which the 
records of variables are to be analyzed are related fundamentally 
to the several conditions under which variables arise. Later 
chapters will deal with these matters at length. But further 
valuable light may be thrown upon both variables and statistical 
units if attention is first directed toward the means by which the 
initial records of an original variable are obtained. This leads 
to the subject of original observation. It is with this subject 
that the next chapter is concerned. 
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ORIGINAL OBSERVATION 


THE quantitative data with which statistical method has to 
do have their origin in individual observations. It is important, 
therefore, at the outset of the present work, to examine the con- 
ditions governing observation, especially those conditions which 
are to be taken into account in subsequent statistical analysis.! 


A. OBSERVATIONAL LIMITS 


The most fundamental of the conditions necessary to satis- 
factory statistical observation is the precise delimitation of the 
scope of the investigation — in technical phraseology, the defini- 
tion of the universe of inquiry. There must be precise indica- 
tion of the ground to be covered. Some material must be in- 
cluded, other material excluded; otherwise the investigation can 
have no significance. Determination of the observational limits 
is a fundamental step in statistical inquiry. 

The process of inclusion and exclusion in statistical observa- 
tion always implies a careful definition of three things: (1) limits 
of attribute, (2) limits of space, and (3) limits of time, within 
which objects are to qualify for inclusion in the “universe of 
inquiry.” These three lines of specification require careful 
examination. 

1 Individual observations are made in two distinct types of statistical inquiry. In the 
first of these, observations are made as of a specified time; in the second, continuously over 
a period of considerable duration. Thus, the population of the United States is counted 
as of given census dates — July 1, 1900, April r, 1910, January 1, 1920; births in the United 
States, in contrast, are reported and recorded as they occur. The former of these types 
of inquiry is commonly called enumeration; the latter, registration. In most respects, the 


requisites of enumeration and registration are identical. The two will be considered to- 
gether for the present. 
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The most fundamental of the three is that relating to the nature 
of the attributes in terms of which the objects of inquiry are to be 
observed. Thus, if a study is to be made of creamery butter 
in the United States, creamery butter must be defined; if a 
count is to be taken of the overseas merchant marine of the United 
States, the merchant ship must be exactly described; if a census 
is to be taken of manufacturing concerns in New York State, 
‘the nature of a manufacturing establishment must be unmis- 
takably indicated. Accurate definition is at the very foundation 
of statistical inquiry." 

Definition of the scope of statistical observation in terms of 
specified attributes is governed largely by practical considera- 
tions.” In the case of the investigation of the merchant marine, 
for example, a distinction presumably is to be made between 
merchant and other shipping. ‘This distinction may rest partly 
on considerations of the size of the vessels, partly on considera- 
tions of employment. A line is to be drawn also between coast- 
wise and overseas services. Here the distinction is presumably 
to be based on considerations of national boundaries. A vessel 
trading between two ports in the same country is commonly said 
to be in coastwise shipping, though it may travel the high seas 
in making the voyage from one port to the other. A vessel 
trading between two ports in different countries, though it may 
ply strictly along the coast, is classified as in overseas trade. 
Such is the common usage of terms in this particular case. In 
general, definitions should be so framed that the final results may 
commend themselves to those best qualified to pass upon the 
significant facts of the situation. 

Whatever may be the lines to be drawn in the particular case, 
careful specifications are necessary from the start. In particular, 
~ consistency of practice is here, as elsewhere in statistical work, 
of the utmost importance. Like cases must be uniformly dealt 
with. The thing or things to be included must be accurately 
described in relatively simple and unmistakable terms — terms 


1 The consideration of statistical units has already given some indication of this. 
2 Some of these considerations are discussed in Appendix A. 


ORIGINAL OBSERVATION 27 


which assure consistent treatment throughout the observation 
of all items falling properly within the range of inquiry. 

The second main line of delimitation required in statistical 
observation relates to the spatial boundaries of the count. Those 
in charge of the inquiry have to determine at the outset over what 
area items are to be included. Sometimes the determination 
of the boundaries is a difficult task, but commonly spatial limits 
have been set up in other connections; as, for example, in the 
definition of the boundaries of continental United States in a 
census of the population. Generally, the specification of the 
area for which the count is to be made does not involve serious 
complications; but the matter is one which must be carefully 
covered in arranging for the enumeration. 

The third sort of delimitation necessary to statistical observa- 
tion concerns the time element. Simple enumerations are of a 
given time. Certain limits of time are set up in advance and a 
count is made of the cases falling within these limits. Thus, if 
we are taking a census of the inhabitants of the United States, 
we may count all on January 1, 1920. Any inhabitant living at 
any time during the twenty-four hours constituting January 1 
is counted. In other words, infants born in the morning and 
aged persons dying in the evening of January 1 are all to be 
included. No distinctions are to be made as to the different 
hours of the period of record. All cases occurring at any time 
within this period are to be definitely reported. 

The period of record may be short or long to suit the condi- 
tions of the inquiry. Thus, instead of making the census count 
as of a given day, it may be made as of a given moment, say 
one minute after midnight on January 1, 1920; or the period may 
be lengthened, as in a census of manufactures, to include all 
establishments doing business at any time during the calendar 
year. The important point is that in the case of enumeration, 
the definite limits of time must be established so that cases may 
be uniformly dealt with and the meaning of the count accurately 
known. Within the limits of the time of record no distinctions 
as to time of occurrence of the individual cases are recognized, 
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Accurate observation, even under the simplest conditions, 
rests fundamentally, then, upon careful specification of the limits 
of inquiry. Lines of delimitation are necessary in three direc- 
tions: (1) the attribute or combination of attributes in terms 
of which the object of inquiry is to be apprehended; (2) the area 
over which the object thus defined is to be enumerated; (3) the 
period during which the object is to be counted. All three of 
these must be covered accurately, explicitly, and consistently 
if the record of the original observations is to take satisfactory 
form. 


B. DIFFERENTIATION 


The three lines of delimitation noted above may be paralleled 
in the original observations by three lines of differentiation. In 
other words, the individual objects included in the count may be 
distinguished with reference to differences in their (1) attributes ; 
(2) geographic locations; and (3) times of occurrence. Let us 
consider in turn each one of these possible phases of differ- 
entiation. 

Differentiation of the individual items according to specified 
attributes may involve at the time of observation a simple classi- 
fication of the individual cases. Thus, in a complete census of 
all live stock, the animals may be classified as horses, mules, 
cattle, sheep, hogs, etc.; or, in a count of retail stores, the shops 
may be classified as shoe, millinery, clothing, hardware, drug, 
etc.; or, in a registration of immigrants, the aliens may be 
classified as German, French, British, etc. Such classification 
sometimes raises serious questions. Thus, in distinguishing 
aliens as to nationality, there may be doubt as to whether to 
classify all those coming from the German Empire as Germans, 
or to consider some as Bavarians, some as Prussians, some as 
Saxons. There is no more difficult art in statistical work than 
that of significant and satisfactory classification. Later the 
subject will be dealt with at length. It is sufficient for the present 
to note that classifications are not infrequently made in connec- 
tion with the original observations. 
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Differentiation as to attribute may go beyond the point of 
simple classification and undertake a recording of the size, or 
degree, or intensity of some variable attribute in the individual 
objects. Thus, in connection with the study of the character 
of immigration, an analysis may be undertaken of the age com- 
position of the aliens admitted. In this case, the age of each 
individual must be ascertained and made a matter of record. 
Here the single attribute age is differentiated and explicitly 
measured. Observation undertaking such differentiation as to 
size, or degree, of a measurable attribute, is the basis of many 
important statistical investigations. 

When measurement is undertaken in connection with original 
observation, a number of new problems arise. For example, 
there is the question of the unit of measurement to be employed. 
Ordinarily, units already recognized in other connections may be 
adopted (see Chapter II), but not infrequently satisfactory 
measurement requires the development of new grading tests, 
of special scales of measurement. Some of the most interesting 
developments in statistical inquiry during recent years have 
involved the construction of such new scales and the develop- 
ment of special grading tests from which quantitative measure- 
ments can be obtained of elements previously regarded as not 
susceptible to quantitative observation. This development has 
been notable in the case of educational investigation, and in the 
application of psychological tests in various lines of social and 
industrial research. Doubtless much is still to be done before 
the development of such special grading tests and scales can be 
regarded as thoroughly scientific; but the movement is in the 
right direction. Everything possible should be done to expedite 
the development of satisfactory units for use in the determination 
of the size or intensity of statistical variables. 

Specific measurement of any attribute always raises questions 
regarding the accuracy which is to be sought in the observations. 
Thus, in attempting to obtain records of the ages of each alien 
inhabitant of the United States, it is necessary to decide in 
advance whether to attempt to secure from each person the exact 
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date of birth, or the age in years and months, or the age in years 
as fixed by the last birthday or the nearest birthday. It is 
rarely desirable to strive for the utmost possible accuracy. 
Reasonable approximation will ordinarily serve in any probable 
subsequent analysis of the returns, and will save time and money 
in the original observations themselves. Undoubtedly, numer- 
ous factors affect the accuracy actually attained in the measure- 
ment of specified attributes. Some of these factors, mentioned 
elsewhere, relate to the problems of observation as a whole and 
not specifically to measurement. But it should not be forgotten 
that one of the phases of the problem of measurement in original 
observation is specification of the accuracy which is to be sought 
in carrying through the investigation. 

In addition to differentiation as to attributes, there may be in 
the original observations differentiation as to the geographic 
location of the items. ‘This form of differentiation is an inherent 
feature of all large-scale enumerations. This follows from the 
necessity of accurate administration of the count. It is quite 
impossible for any single observer to cover all the cases of a 
large-scale count. Several observers — often very many, some- 
times thousands — may have to codperate in covering all the 
cases within the inquiry. It is necessary to see that the observers 
cover the ground completely without any overlapping. There 
must be no trespassing. Duplications and omissions alike must 
be avoided. This necessitates accurate definition of the bound-_ 
aries of the area over which each observer is to operate. The 
administrative necessity of carving up the territory of the count 
is supported furthermore by the interest which commonly inheres 
in the figures for particular sections of the total area. Thus, ina 
population count of the United States, it is necessary for political 
reasons to secure counts of the population for a large number of 
small political subdivisions. There is reason, therefore, for 
reporting individual items in such a way as to indicate definitely 
the location of their occurrence. In every considerable count 
there must be specification of the detail with which the location 
of individual cases is to be reported. ‘This line of differentia- 
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tion is therefore a general characteristic of statistical enumeration. 
Its precise relationship to the requisites of original observation 
should be carefully noted. 

Differentiation of the individual objects according to the time 
of their occurrence is a third form. As previously indicated, a 
distinction is to be recognized here between simple enumeration 
and continuous registration. It is one of the essential features 
of simple enumeration that all cases are as of a given time; 
in other words, all cases falling within the limits of time 
specified for the inquiry are reported as if they occurred simul- 
taneously. Successive censuses, on the other hand, are dis- 
tinguished by differences in the time of occurrence of their 
items. Thus, differentiation of time may appear in data drawn 
from enumerations. 

Registration, or continuous enumeration, is always charac- 
terized by the fact that differences in the time of occurrence of 
the individual items are carefully observed and recorded. The 
extent to which the details of time are to be noted in registration 
is a matter to be determined in advance. Explicit rules must be 
set up. Thus, if an agent of the Department of Commerce is to 
report the arrival of ships at a given port of entry, he must be 
instructed whether to report the ships merely as of the day 
of their arrival, or as of the hour, or as of the nearest half- 
hour, or as of some smaller interval of time. Presumably no 
greater labor in reporting these details should be required than 
is necessary. On the other hand, sufficient detail should be 
required, since once the returns are filed, there is ordinarily 
no opportunity to increase thereafter the accuracy of the orig- 
inal reports. 

Differentiation of cases or material within the observational 
limits is not a necessary feature of original observation. It is, 
however, a common one. Moreover, differences of attribute, 
location, or time of occurrence, noted in the original observations, 
enlarge greatly the opportunities for subsequent analysis. They 
are, in fact, the source from which the variables spring. 
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C. COMPLETE vs. PARTIAL OBSERVATION 


The discussion thus far has assumed that the objective of 
original observation is a complete record of all cases falling within 
the indicated universe of inquiry. This very commonly is the 
purpose of the count. In this case enumeration, or registration, 
is said to be complete. 

The extent to which the purpose of complete observation is 
actually achieved varies, of course, from one inquiry to another. 
In some cases — for example, in accounting records — perfect 
completeness may be taken for granted. In other cases, the 
count may fall somewhat short of completeness though an 
attempt may have been made to include every case. *When it is 
to be presumed that the count is not perfectly complete, some 
statement should be made concerning the extent to which duplica- 
tions or omissions are likely to have affected the result. 

However, a complete count of the cases included within the ob- 
servational limits is not always intended. Instead, the aim may 
be to get a sample of the total number of cases. In this event a 
representative — not a complete — set of items constitutes the 
objective of observation. The data are said to be had by sam- 
pling the universe of inquiry. 

When sampling is undertaken, a number of fundamental prob- 
lems are raised, all of which center about the issue of obtaining 
a sample which is truly representative of the larger universe of 
cases from which the sample is drawn. The prerequisites of 
successful sampling are randomness in the process, and adequacy 
in the size of the sample. Randomness implies that the observa- 
tion of individuals shall not be subject to any bias whatsoever, 
but shall be determined on principles of pure chance. Adequacy 
in the size of the sample implies that the number of cases included 
shall be large enough to allow the principles of probability to 
operate dependably! In respect to both randomness and 
adequacy, sampling commonly presents serious problems requir- 
ing marked statistical insight and is satisfactorily dealt with 
only after considerable experience in statistical work. 


1 This subject is considered further in Chapter XXIV. 
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When properly safeguarded, sampling is one of the most 
important labor-saving practices in statistical inquiry. In 
many respects, it is subject to the same considerations as com- 
plete enumeration; in fact, most of the rules governing complete 
enumeration are equally applicable to observation under the 
conditions of sampling. 


D. AccURACY 


Almost all phases of original observation for statistical purposes 
raise questions regarding the final accuracy of the returns. 
Several of these questions are explicit in the specification of the 
count. ‘This is true, for example, of the degree of accuracy to be 
sought in the actual measurement of the variable attribute. The 
completeness of the count is determined in part by the extent to 
which efforts shall be made to obtain a record of the very last 
case. Aside from these more obvious elements of accuracy, 
however, a number of other influences may give rise to error in 
the results. 

Simple ignorance on the part of the observer or informant may 
lead to inaccuracy. Not infrequently questions which lie beyond 
the knowledge of those to whom the questions are put appear 
on statistical questionnaires. Ordinarily the result is no record 
at all; sometimes data are submitted despite the fact that no 
dependable information can be supplied. 

More commonly, carelessness — at least an unwillingness to 
make the necessary effort —is the occasion of error. Careful 
statistical recording implies a considerable measure of codperation 
on the part of the informant or observer. Sometimes such co- 
operation may be compelled through authority; at other times 
it is given from conscientiousness or the sense of loyalty. Not 
infrequently, however, the spirit of codperation is deficient or 
entirely lacking. In this case, inaccurate or inadequate returns 
result. 

Willful falsification is another cause of inaccuracy. Thus, 
falsification may spring from some definite bias or from an 
assumed self-interest. Age returns commonly understate age 
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at certain periods of life because the parties unquestionably 
wish to understate. With extremely old people, willful falsifica- 
tion tends to run in the opposite direction, that is, toward over- 
statement. Inquiries as to whether individuals are employers 
or employes invariably lead to an excessive report of employers 
because many of the informants take pleasure in reporting them- 
selves as employers. Inquiries regarding services in the army 
almost invariably lead to overstatement because individuals 
believe that a statement of services may subsequently be valuable 
in presenting claims for relief or pensions. Great care must be 
exercised to avoid inaccuracy arising from willful misrepresenta- 
tion on the part of the informants. 

_ Technical difficulties may give rise to inaccuracy in the records. 
No degree of intelligence or willingness can provide answers to the 
questions which are sometimes asked in statistical inquiries. 
An attempt to obtain a statement of the reasons for certain 
expenditures of income may fail miserably because of the impossi- 
bility of saying exactly why these expenditures have been made. 
Motives are elusive objects of statistical inquiry. Similarly, 
there are grave technical difficulties in the grading of intelligence 
in connection with mental alertness tests. The practical limits 
on accurate information must be carefully recognized when the 
scope of statistical inquiry is determined. 

All these considerations have an important bearing upon the 
accuracy of statistical data. Essential accuracy may fall far 
short of nominal accuracy through the operation of such factors 
as ignorance, carelessness, and bias. So far as a notion of these 
factors is ascertainable, it must be carefully measured and 
accounted for in the subsequent interpretation of results. The 
problem of accuracy is fundamental, of course, to all phases of 
statistical method. 


The importance of the conditions governing original observa- 
tion should now be apparent. These conditions affect materially 
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the technique of subsequent analysis of the data. Problems of 
observational limits, differentiation, completeness, accuracy, 
must all be carefully considered. A full understanding of the 
conditions involved in original observation is essential in every 
statistical investigation.! 

1See Appendix A for a brief consideration of some of the more technical phases of 


original observation. The problems of utilizing data from secondary sources are discussed 
in Appendix B. 


CHAPTER IV 


CLASSIFICATION AND STATISTICAL SERIES 


ORIGINAL observations are merely the raw stuff of statistical 
analysis. They require systematic treatment before they begin 
to yield significant results. Individual cases have to be appro- 
priately grouped. Related items have to be arranged in orderly 
fashion. Characteristics of the variable have to be ascertained. 
The relationships of one variable to another have to be brought 
to light. In short, statistical analysis has to be undertaken along 
a number of distinct lines. 


A. CLASSIFICATION 


One of the most important treatments of original data is 
classification! The general nature of classification is known to 
all; the individuals of a larger group are divided into smaller 
groups, or classes, on the basis of a specified criterion or charac- 
teristic feature.? Thus we have classifications of the population 
by sex; of workers, by occupation; of ships, by motive power ; 
of automobiles, by style of body. Familiar illustrations might 
be multiplied indefinitely. Classifications are, in fact, one of the 
most common forms of everyday knowledge. 

By very nature, classifications are statistical in character. No 
classification is possible save where there are numerous individuals 
belonging to some larger group, or “population,” or “universe.” 
The enumeration of this total group is itself a statistical task of 
primary importance; but, in classification, this has to be followed 


1Tt is to be noted that in one sense the original data themselves are a product of classi- 
fication in that they result from a process of inclusion and exclusion — that is, grouping — 
of individual cases. 

2 Tf but two divisions are set up, classification is said to be dichotomous; if more than two, 
manifold. 
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by a further count —a count of the cases in each one of the 
subdivisions set up in the classification. Clearly, then, classifica- 
tion falls within the general scope of statistical method.! 

Though classification is undoubtedly a primary phase of statis- 
tical inquiry, the various forms of classification are not all of 
equal significance in statistical analysis. Some classifications 
offer much larger opportunities for comprehensive and signifi- 
cant treatment than do others. An examination of the various 
types of classification will serve as an excellent introduction to 
other phases of statistical analysis. 

There are, at bottom, three totally different bases upon which 
the individuals of a larger group may be classified. There may 
be classification on the basis of (1) distinctive characteristics 
or attributes; (2) geographic location; (3) time of occurrence. 
Every logical classification falls into one of these three funda- 
mental forms. 

The three forms are easily illustrated. Take, for example, 
the classification of the immigrant aliens admitted to the United 
States during the fiscal year ended June 30, 1922. The total 
number of these was 309,556.” This total may be subdivided 
as shown in the tables appearing on pages 38 and 39. 

The three classifications are marked by certain fundamental 
differences. The groupings of Table 1 A relate to attributes in 
which racial distinctions find expression. The classes of Table 
1 B are entirely geographic in character; they make no reference 
to any distinctive attributes of the individual immigrants. The 
groupings of Table 1 C relate entirely to time, no mention being 
made of differences in either attribute or location. The three 
forms can hardly be mistaken. They may be conveniently re- 
ferred to as altributive, geographic, and temporal. 


1 Classification is ordinarily an early phase of statistical inquiry. Frequently a simple 
classification is undertaken at the time of original observation; thus, in almost every count 
of population, a subdivision by sexes is made. Not infrequently, more elaborate groupings 
are recorded in connection with enumeration or registration. On the other hand, as will 
presently be seen, certain types of classification can be satisfactorily undertaken only after 
original observation has been completed. 

2 For original data in this connection see Report of Commissioner of Immigration, 1922, 
PP. 24, 25, 32, 94, 97. 


. 
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TABLE 1. ImMIGRANT ALIENS ADMITTED TO THE UNITED STATES 


Fiscal Year Ended June 30, 1922 


A. By Race 


RACE 


NumsBer ADMITTED 


Alliraces ie. 
African (black) . 
Armenian . Sel Ar eae once Rey 
Bohemian and Moravian (Czech) 
Bulgarian, Serbian, and Montenegrin 
Chinese he ae 
Croatian and Slovenian . 
Cuban . A ee Lom ae Enea 
Dalmatian, Bosnian, and Herzegovinian 
Dutch and Flemish 
East Indian . 
English 
Finnish 
French 
German 
Greek . 
Hebrew 
Trish toes 
Italian (north) . 
Italian (south) . 
Japanese . 
Korean 
Lithuanian 
Magyar 
Mexican . .. 
Pacific Islander . 
Polish . 
Portuguese 
Rumanian 
Russian Ponce oy, 
Ruthenian (Russniak) A. Oso ee eee 
Scandinavian (Norwegians, Danes, Swedes) 
Scotch . 
Slovak . 
Spanish ae 
Spanish American . 
Syrian . 
Turkish 
Welsh . he Ginga Popeyes 
West Indian (except Cuban) 
Other peoples By eee 


309,556 
5,248 
2,249 
3,086 
1,370 
4,465 
3,783 

698 
307 
3,749 
223 
30,429 
2,500 
13,617 
31,218 
3,821 
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TABLE 1. Inwmicrant ALIENS ADMITTED TO THE UNITED States — Continued 


Fiscal Year Ended June 30, 1922 


B. By Port of Entry 


Port or ENTRY 


NumBer ADMITTED 


All ports . 
Boston, Mass. . 
New Bedford, Mass. 
Providence, R. I. 
New York, N. Y. 
Philadelphia, Pa. . 
Norfolk, Va. 
Miami, Fla. 
Key West, Fla. 
Tampa, Fla. 
iNew Orleans leanne a sn 
Other Atlantic and Gulf Ports 
Via Mexico . : it 
San Francisco, Calif. 
Portland, Ore. . 
Seattle, Wash. 
Via Canada . 
Alaska 
Honolulu, Hawaii 
Porto Rico . 


309,556 
4,924 
527 
2,010 
200,778 
3,257 
531 

996 


C. By Month of Entry 


Monty or ENTRY 


NumBrer ADMITTED 


All twelve months 
July, 1921 
August 
September . 
October 
November . 
December 


January, 1922 . 
February 
March 

April . 

- May . 

une. 


309,556 
355504 
375902 
36,217 
33,201 
34,488 
22,689 


15,928 
10,792 
14,803 
18,967 
24,169 
24,776 
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From a statistical point of view, the first of these three forms 
of classifications — the attributive — may be profitably sub- 
divided. In the first place, a broad distinction may be recog- 
nized between attributive classifications involving differences of 
kind, and those involving differences of degree. When im- 
migrants are classified by race, as in Table 1A, the groupings are 
of kind. The same is true of a sex classification.1_ When, on the 
other hand, immigrants are classified according to age—say, 
into the young, middle-aged, and old — the differences are of de- 
gree. The contrast between these two varieties of classification 
is, as will be shown presently, of great statistical significance. 

In the second place, classifications by degree may be further 
subdivided. Some of these classifications are qualitative in 
character, some quantitative. An age classification into the 
three groups noted above — the young, middle-aged, and old — 
is qualitative; one into the three groups, ‘“‘under 16 years, 16 to 
44 years, 45 years and over,” is quantitative.? The difference 

1 The sex classification of the same 309,556 aliens is as follows: 


TABLE 1 D.— Imurcrant ALIENS ADMITTED TO THE UNITED STATES 
Fiscal Year Ended June 30, 1922 


By Sex 
SEX NuMBER ADMITTED 
IBOEDESEKES amen isin Guts ec 300,556 
Miglenrenreg sh see) wom. S32 140,741 
Bemalew’s wy aon eos 159,815 


2 A simple age classification of the aliens admitted during the fiscal year 1922 is shown 
in the following table: 


Tasie 1 E.— Imuicrant Atrens ADMITTED TO THE UNrrEep STATES 
Fiscal Year Ended June 30, 1922 


By Age 
AGE NuMBER ADMITTED 
All ages”. aay en eee ey 9. te 300,556 
Wnderrosyearsaeecee Paes 63,710 
WOME oh 6 Go Go 210,164 
45 yearsandover ..... . 35,682 


——————— 
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between these has an important bearing upon subsequent statis- 
tical analysis. 

In the third place, attributive classifications in general may be 
subdivided into simple and multiple forms. In the simple, the 
individuals of the total group are classified with reference to a 
single feature or attribute; in the multiple, with reference at the 
same time to two or more characteristics. Thus, the men of a 
college class may be classified by age as follows: 


TABLE 2A. SIMPLE CLASSIFICATION OF MEMBERS OF COLLEGE CLASS 


By Age 
AGE | NUMBER OF 
(nearest year) MEN 
15 I 
16 6 
I7 ame) 
18 - 30 
ae) I4 
20 2 
21 2 
22 I 
All ages 75 


But another possibility is a classification of the same men by age 
and, say, height at the same time as follows: 


TABLE 2 B. MULTIPLE CLASSIFICATION OF MEMBERS OF COLLEGE CLASS 
By Age and Height 


A HeIcHT (nearest inch) 
GE 
(nearest year) 


62-63 64-65 66-67 68-69 70-71 72-73 | 74-75 | All Heights 
i I I 
16 2 I I 2 6 
17 2 3 6 5 2 I site) 
18 I I 5 II 8 2 2 30 
19 2 4 4 3 I I4 
20 I I 2 
21 I 2 
22 I I 
All Ages 4 A || seas az aI Io 4 75 


| 
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Each individual case is here classified on the basis of two dif- 
ferent criteria. Some of the most significant statistical analyses 
deal with such multiple classifications. 

Recognition of the different forms of classification is of great 
importance in statistical analysis. Commonly, the entire tech- 
nique of analysis is affected by the character of the differences 
recognized in the original grouping of cases. Classification is, 
therefore, not only one of the most familiar, but one of the most 
significant phases of statistical work. 


B. STATISTICAL SERIES 


Another fundamental statistical process is seriation — the 
process by which unorganized data are cast into the form of a 
statistical series. An idea of the importance of this phase of 
method is most easily obtained from an examination of the 
structure of statistical series. 

A statistical series is an orderly succession of items relating 
differences in one variable to differences in another. Technically 


eee LES | ° ° . ISR TER . 
we may say that a statistical series expresses One variable as 


a function of another.1. The variable, the values of which are 
made the basis upon which to express differences in the other, 
is referred to as the independent variable; the other, as the 
dependent. The values of the independent variable are more or 
less arbitrarily set up in forming the series. The independent 
variable is, thus, the controlling factor in analysis. The depend- 
ent takes its values from the independent. Independent and 
dependent variables together give the series its character. 

A concrete example will help to make these notions clear. 
Consider the data shown in Table 3. The figures state in 
clear and simple form the course of the farm price of cotton 
in the United States over a period of fifteen years. A definite 
relationship is set forth: the relationship between time (at 
yearly intervals) and the price of a pound of cotton (in cents). 


1 A series is commonly given algebraic expression by letting such symbols as x, y, z denote 
associated variables. Different algebraic equations will then represent the relationship 
between the variables. Thus, y = 22+ 4 expresses a functional relationship between 
vandy. The expression therefore may stand for a statistical series. 
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TABLE 3. Farm Price oF RAW Corton IN THE UNITED STATES, 


IQIO—19 23 
By Years 

YEAR PRICE 

(on December 1) (cents per pound) 

IQIo 14.1 

IQII 8.8 bi 

Igi2 II.9 

IQI3 AS 

1914 6.8 

IQI5 II.3 

1916 19.6 

IQI7 27e7, 

1918 27.6 

1919 35.6 

1920 13.9 

IQ2I 16.2 

1922 23.8 

1923 31.0 


Time is the independent variable; price, the dependent. The 
arrangement of the several items is orderly and serves to throw 
into bold relief the nature of the changes in price over the 
period in question. Data in such a form constitute a statis- 
tical series. 

Statistical series assume widely different forms. Thus, a 
series may express the relationship between weight and age in 
children; the relationship between the number of bank failures 
and general business conditions; the relationship between the 
price of eggs and the successive seasons of the year. In general, 
four fundamental types of series are to be recognized. These 
may be designated the (1) simple attributive, (2) multiple at- 
tributive, or correlative, (3) geographic or spatial, and (4) tem- 
poral. Concrete illustrations will help to make clear the essen- 
tial differences among the four. 

The data of Table 4 constitute a simple attributive series, — 
commonly called a frequency distribution." 


1 The term “frequency distribution” does not in fact describe this form any more truly 
than it does certain others (see Chapters VII and VIII). 
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TABLE 4. CLASSIFICATION OF 700 NEWSBOYS OF CINCINNATI IN 1918 


By Age 
_ AGE NumBer oF NEwsBoys 
(in years) 

All ages 700 
8 8 

9 14 
Io 76 
II 150 
12 154 
13 I40 
14 84 
15 48 
16 t2 
7 14 


Source: Adapted from data given by M. B. Hexter in ‘‘The Newsboys of Cincinnati,” 
Studies of the Helen S. Trounstine Foundation (1919), vol. I, no. 4. 
The series expresses the variable frequency of occurrence (number 
of cases) of the different sizes or magnitudes of the variable 
attribute, age. No distinctions of time or location are involved. 
The form is very frequently encountered in statistical analysis. 
It will be considered at length in an early chapter. 

In Table 5,a multiple attributive, or correlative, series is shown. 


TABLE 5. AVERAGE WEEKLY EARNINGS OF 700 NEwsBoys OF CINCINNATI 


IN 1918 
By Age 
(anaes AVERAGE WEEKLY EARNINGS 

8 $ .31 
9 57 
ae) .68 
It 87 
iw) 95 
23 iegayE 
14 1.62 
TS 2.14 
16 3.10 
17 3.68 


Source: J 
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The data in this case show the connection between fwo variable 
attributes. As age increases, earnings increase. There is no 
statement here of variable frequencies, nor of differences of loca- 
tion or time. In series of this sort, the data state the quantita- 
tive relationship of two variable traits, or possessions, or experi- 
ences, observed in a group of related individuals. 

The series given in Table 6 is of an entirely different sort. 


TABLE 6. PERCENTAGE OF FoREIGN BorN IN POPULATION OF NEW ENGLAND, 
JANUARY I, 1920 


By States 
STATE Per CENT 
IV aINe ate ee ts os Be eee oe. hee 14.0 
New, Hampshire ses. ss eae 20.6 
NisgiNcge ee Go Gl a elenennee ce 12.6 
IMassachuset isu mmm raat me Lens 28.3 
INimectolishmml .» 2. ¢ a o a 6 29.0 
Connecticutaere mr 15 2) 4. - 27.4 


Source: Fourteenth Census of the United States, 1920, vol. II, p. 33. 


The distinctions made in this case have definite geographic signifi- 

cance. Itis possible to think of the variable percentage of foreign 

bornin terms of variable location—— north and south, east and west. 

In other words, the dependent variable is spread out on the 

map. The series is essentially geographic, or spatial, in character. 
The fourth type of series is illustrated in Table 7. 


TABLE 7. Corron Crop or THE UNITED SratTes (EXCLUSIVE or LINTERS) 
By Crop Years, 1917-1922 


Crop YEAR RUNNING BALES 
1917 11,342,780 
1918 11,983,582 
IQIQ 11,382,084 
Ig20 1353745237 
1921 8,039,673 
1922 9,815,397 
Source: Dept. of Commerce, Bureau of the Census. ‘‘Cotton Production and Dis- 


tribution,” Bulletin 153, p. 10. 
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The data here draw no distinctions of kind, nor of place; all 
cotton in the entire United States is included. The differences 
noted in the several crops relate altogether to different periods of 
time. Where the other types of series are static, this type is 
dynamic. Series relating to time are very important in economic 
analysis and will later be examined at length. They are com- 
monly called time series. 

The process of forming series of these different types lies at the 
very center of statistical work. Analysis in general undertakes 
to discover and interpret the relationships existing among vari- 
ables. Such relationships are best stated quantitatively in the 
form of statistical series. Not infrequently analysis has both 
its beginning and its end in a series — its beginning in a series 
obtained from simple tabulation, its end in a series derived 
from elaborate curve-fitting or correlation measurements. A 
thorough understanding of the nature and forms of the series is 
thus indispensable in statistical analysis. In fact, the distinc- 
tions among the several forms of series have so important a bear- 
ing upon statistical procedure that they may well be made the 
basis upon which to examine the various phases of statistical © 
technique. 

Between seriation on the one hand and classification on the 
other there exists a close connection. On the one hand, classifi- 
cation may result directly in the formation of a statistical series. 
It does so whenever the basis of classification consists of the 
gradations of a variable. Thus, if the vessels of the Merchant 
Marine are classified by speed — say, into the classes of those 
making under g knots, 9 to 144 knots, 15 to 174 knots, 18 knots 
and over—a series is obtained. If, however, the basis of 
classification is not in any way quantitative, no series can result 
from a grouping of the cases. A classification of the vessels of 
the Merchant Marine by nature of service — say, into (a) freight, 
(6) freight and passenger, (c) freight and refrigerator, (d) freight, 
passenger, and refrigerator, and (e) tanker — is not a statistical 


series, for the data do not express a relationship between two 
variables. 
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Seriation, likewise, may or may not involve classification. It 
does not involve classification if individual observations are not 
grouped but merely placed in order. If the price of wheat is 
recorded on the first day of each month, and the items thus 
secured are set in their chronological order, a time series is 
obtained without classification: the individual items have been 
merely arranged, not grouped. Similarly, a spatial series may be 
obtained without classification if there is no grouping of the 
individual items. Seriation does not necessarily entail classifica- 
tion, any more than classification necessarily entails seriation. 

Nevertheless, the formation of statistical series does commonly 
require a classification of individual items. Classification is, 
in fact, a process of very general significance. If a large mass of 
unorganized individual observations is in hand, the first task of 
all is ordinarily to classify these items on some significant basis. 
Classification almost always either accompanies or immediately 
follows the collection of original items. It is an essential part of 
statistical tabulation; an important phase of analysis. A care- 
ful examination of the results of statistical classification may 
well constitute the first stage of the present study. 


FORMULATION OF STATISTICAL 
DISTRIBUTIONS 


CHAPTER V 


CLASSIFICATIONS NOT IN SERIAL FORM 


As noted in the preceding chapter, classification and seriation 
are statistical processes which run sometimes together, and some- 
times apart; in other words, the results of classification at times 
take the form of statistical series and at other times do not. 
Invariably, however, classification serves to break up the larger 
aggregate into smaller groups. Through classification the total 
mass is distributed. Classifications may be appropriately re- 
ferred to, therefore, as statistical distributions. When these 
distributions are in the form of series, they may be called dis- 
tributive series. Series of this sort are dealt with in the chapters 
that follow. The purpose of the present chapter is to consider 
classifications which are not in serial form. These are, in general, 
of the attributive type. They may be best examined under the 
two headings: (1) classifications by kind; (2) classifications of 
degree (qualitative)! 


A. CLASSIFICATIONS BY KIND 


Classifications by kind are probably the most common type of 
all. In classifications of this type the related objects of the total 
group are subdivided with reference to certain combinations of 
attributes which serve to distinguish or characterize the individ- 
ual members of the group. The several classes do not represent 
mere differences of size or intensity or degree in some single 
trait or attribute. On the contrary they relate to differences in 


1 Larger masses are sometimes divided into components which can hardly be regarded 
as parts under a logical classification; e.g. capitalization of constituent units in an industrial 
combination. The statistical treatment of such distributions does not differ substantially 


from the treatment of classifications by kind. ‘‘Component parts,” therefore, will not 
be separately discussed. 
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combinations of attributes which are commonly recognized as 
constituting differences ‘‘of kind.” The four tables given below 


serve to illustrate the type: 


TABLE 8. AVERAGE NUMBER OF WAGE EARNERS IN Major INDUSTRIAL 
Groups or MANuFACTURING ESTABLISHMENTS IN THE UNITED STATES 


IN 1919 

INDUSTRIAL GROUP Ree rein son 
All industries . 9,096,372 
Food and kindred products 684,672 
Textiles and their products 1,611,300 
Tron and steel and their products 1,585,712 
Lumber and its remanufactures . 839,008 
Leather and its finished products 349,362 
Paper and printing ‘ 509,875 
Liquors and beverages 5 55,442 
Chemicals and allied products 427,008 
Stone, clay, and glass products ; 298,059 
Metals and metal products other than izon n and steel : 330,409 
Tobacco manufactures : 157,097 
Vehicles for land transportation . 495,939 
Railroad repair shops 515,709 
Miscellaneous industries 12275000 


Source: Statistical Abstract, 1922, p. 199. 


* Numbers given are averages of the numbers reported on the fifteenth of each month 


of the census year. 


TABLE g. CAUSES OF BUSINESS FAILURES IN THE UNITED STATES IN 1922 


CAUSE 


NUMBER 


Total 


Incompetence . 
Tnexperience 

Lack of capital 
Unwise credits 
Failures of others . 
Extravagance 
Neglect 
Competition 
Specific conditions 
Speculation . 
Fraud 


20,014 


6,404 
1,142 
5,855 
230 
2260 
82 
257 
183 
4,638 
66 
931 


Source: Bradstreet’s, February 4, 1922, vol. 50, p. 86. 
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TABLE 10. AMOUNTS OF DirrFERENT Kinps oF MONEY IN THE UNITED STATES 
IN Banks OTHER THAN FEDERAL RESERVE BANKS, AND IN CIRCULATION, 


JUNE 30, 1921 


KInpD 


AMOUNT 


Aggregate . 


Gold coin and bullion 
Silver dollars 
Subsidiary silver 


Total metallic 


United States notes 
Federal reserve notes 
Federal reserve bank notes 
National bank notes . 


~ Total notes 


Gold certificates 
Silver certificates . 
Treasury notes of 1890 . 


Total certificates and Treasury notes 


$5,779,437,473 


$ 883,404,285 
7519531333 
261,650,873 
$1,220,108,491 


$ 342,640,537 
2,680,494,274 
148,349,552 
729,550,513 
$3,901,043,876 


$ 452,174,709 
201,534,213 
1,576,184 

$ 655,285,106 


Source: Treasury Dept., Finance Report, 1921, p. 547. 


TABLE 11. GENERAL DEPARTMENTAL EXPENSES OF STATE GOVERNMENTS 
IN THE UNITED STATES IN 1921 


DEPARTMENT AMOUNT 

Total $918,309,400 
General government . : $ 71,288,483 
Protection to person and property 52,738,711 
Development of natural resources 42,021,912 
Health and sanitation 21,995,742 
Highways 106,377,190 
Charities, hospitals, and corrections. 162,468,927 
Schools 320,863,282 
Libraries . 2,002,778 
Recreation 1,869,609 
Miscellaneous 127,682,757 


Source : Dept. of Commerce, Bureau of the Census, Financial Statistics of States, 1922, p. 23. 
The problems involved in the formulation of distributions of 
this type are essentially problems of logic. Certain formal rules 
are to be recognized : the classes taken together should completely 
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cover the “universe of inquiry’; the classes should be mutually 
exclusive; they should involve one, and only one, criterion, or 
basis of division. Besides such rules of formal logic, certain other 
considerations apply. The distinctions recognized should be 
uniform and constant. They should possess real significance. 
There is a strong presumption in favor of classifications which 
have been long enough established to be commonly referred to 
and customarily employed. Effective classification usually pre- 
supposes thorough acquaintance with the material to be ana- 
lyzed but in general presents no problems in statistical tech- 
nique requiring elaboration at this point. 

Statistical treatment of classifications by kind commonly 
follows three lines: (1) conversion of the absolute figures to per- 
centages of the total; (2) rearrangement of the items in the order 
of magnitude; (3) graphic representation. These three lines 
may be examined in turn. 

Accurate interpretation of a classification by kind is often 
facilitated by reduction of the absolute figures to percentages. 
Of course, in some analyses importance attaches to the size 
of the group totals, in which case attention is to be given to the 
absolute figures. As a rule, however, interest runs to the char- 
acter of the distribution rather than to the absolute size of the dif- 
ferent classes. Though the nature of the distribution is shown in 
some measure by the absolute numbers, it is not given as simply 
nor as clearly as by the corresponding percentages. 

The point will be perfectly evident from Table 12. The column 
of percentages given in this table is much more clearly indicative 
of the character of the distribution than the column of absolute 
numbers shown in Table 11. The practice of thus converting 
distributions of absolute numbers to percentage distributions is 
of wide applicability. The conversion is always suggested when 
the character of the distribution is the object of inquiry rather 
than the actual numbers falling within the several recognized 
classes;, and the conversion is especially indicated when the 
problem is to compare the forms of distribution in several aggre- 
gates of different size. 
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TABLE 12. PERCENTAGE DISTRIBUTION OF GENERAL DEPARTMENTAL EXPENSES 
or STATE GOVERNMENTS IN THE UNITED STATES IN 1921 


EXPENSE PERCENTAGE 

BROtaler cere ne ea Tee fieg al tronaea Mt nye eens io 100.0 
General government .. . SL lsiy OE ea eae 7.8 
Protection to person and propery, Pers tae wanes ui By 
Development of naturalresources. . . .... .» 4.6 
iealthrandssanitationme ism el) oeeemen ets ne ec) is 2.4 
Highways. . . Pipi <0) hy ota eaente 11.6 
Charities, hospitals, ‘and corrections De gee eet TAC) «c 1eyey 
SCHOOLS Peet w oe Sis, Sera ss Se come) eroeers 35.9 
liplerietes: “hak 3 SA ie Lae rae aarti on OM Buin c 22 
INECrEATION MUA ES) fw 50a <8 Me) ced. Loh td ace 2 


DViaiScellanCOUSia mewke cs cited Set us) in tele cet acme 13.9 


Rearrangement of the classes in the order of magnitude is a 
second step commonly taken in the treatment of classifications 
by kind. The rearrangement may obscure somewhat the logical 
basis of classification, but has the advantage of throwing into 
clear relief the relative importance of the different groups. The 
change in emphasis which is effected by the rearrangement is 
evident upon comparison of Table 13 with Table og. 


TABLE i3. CAUSES oF BUSINESS FAILURES IN THE UNITED STATES IN 1922 


CAUSE NUMBER 
eOta ree een ee ees ee oe eee 20,014 
IGNGORNTHINES 4 4 6 5°96 6 6 a Bo & | 6,404 
ackzoicapitalic 4 eo. «30. ieee ae ee 5,855 
ppeciic conditions. (i. 4 «2. <4) se one 4,638 
experience ty un, uk. 1,142 
Fraud . 931 
Neglect 257 
Unwise credits oF ns ay cig ee 230 
Parlitesiot otlicts tes aki, aa een ae ee ee 226 
Competition . Silos Meare ae ica se “as 183 
Btra Va Sanceine) Was cio soit tl Aus eee 82 © 


Speculation! 1) were gi sy oh Ok See 66 
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Rearrangements of this sort frequently contribute substantially 
to a thorough understanding of the character of the distribution. 

Graphic representation is a third step that may be taken in 
the analysis of classifications by kind. In general, classifications 
of this sort are to be represented by bar diagrams. In these 
diagrams, the width and spacing of the bars are kept uniform; 
the length of the bars is made to correspond to the magnitudes 
of the several classes. Chart 1 serves to illustrate the form. 


CHART 1. PERCENTAGE DISTRIBUTION OF AVERAGE NUMBER OF WAGE EARNERS 
IN Major InpustrIAL Grouprs OF MANUFACTURING ESTABLISHMENTS IN 
THE UNITED STATES IN 1919! 


5 10 15 


Textiles 

lron and stee/ 
Lumber 

Food 

Railrood repair shops 
Paper and printing 
Vehicles 

Chemicals 

Leather 

Non- ferrous metals 
Stone, clay, and glass 
Tobacco 


Liguors and beverages 


Miscellaneous 


One of two styles may be followed in constructing the bars of 
diagrams of this sort: they may be placed either horizontally or 
vertically. For classifications by kind, however, there are 
marked advantages in the horizontal bar form. In the first 
place, this form permits of a somewhat more accurate reading 
of the lengths of the bars. In the second place, it permits of a 
more convenient combination of the tabular and graphic exhibits : 


1 The groups are designated by their principal products. 
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the chart and table may be placed alongside one another as in 
Chart 1, the figures then appearing in a column in which compari- 
sons are facilitated. In the third place, the horizontal form is a 
style which is effectively constructed in simple fashion on a 
typewriter, as shown in Chart 2, without resort to any drafting 
instruments. 


PERCENTAGE OF NATIONAL INCOME CONTRIBUTED BY VARIOUS INDUSTRIES 
IN THE UNITED STATES 
(Average 1909 to 1918) 


TABLE 14 CHART 2 


INDUSTRY Per CENT 


Unclassified industries and 


WU, Shs Ae 6 we) all Rey |2S.2'0:0.0'0'0.0'0'6,0'0'9'0,0'9'9-6.0'0,5 8.8, C9 ODS O.O.20K 
Manufact’r’g,incl.handtrades | 30 | XXXxXXXXXXXXXXXXKXXXXXXXXXXXXKX 
syeaaoulhaes’ 5 5 9 6 6 oi) “7 |) Mowe Ceo ooo CC.0.0.0.¢ 
Transportation . (e) || 2eieo,c,0/e,0'e% 

Government . (th || 2eoe.e,0;¢ 
Mineral production B || XoBx 
Banking yA) BOX 


Source: National Bureau of Economic Research, Income in the United States, vol. I, 
p. 23. 


Finally, the horizontal position for the bars of charts representing 
distributions of this type serves to distinguish these distributions 
from certain others, to be considered later, in the graphic repre- 
sentation of which the vertical bar diagram is already standard 
practice. In general, then, the horizontal bar diagram would 
seem to be the better device for the graphic representation of 
classifications by kind.! 

Whether horizontal or vertical bars are employed, the bars are 
to be drawn from a common base-line. In the case of horizontal 
bars, the base-line is at the left-hand edge of the diagram; in the 
case of vertical, it is at the bottom. It is common practice to 


1Tt should be said, however, that the vertical arrangement of bars is clearly indicated 
when the diagram is designed to show changes in a classification by kind at different points 
in time. This conclusion follows from the fact that in the graphic representation of varia- 


tion related to time, standard practice requires the location of the scale of time along the 
horizontal axis. 
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draw in behind the bars a few evenly spaced scale-lines so that the 
length of the bars can be read approximately from the diagram. 
Accompanying figures are best placed in tabular form alongside 
of the base-line of the chart rather than inside the bars.! Care 
should be exercised not to encumber a diagram with detailed 
data. The purpose of a chart is to convey a simple and clear 
visual image. 

Mention may be made of certain other types of diagram some- 
times employed to exhibit classifications by kind. In the figures 
shown thus far, the sizes of the several classes are shown by bars 
placed side by side. An alternative scheme is to place the bars 
end on end. Chart 3 illustrates this possible arrangement. 


CHART 3. VALUES OF ORCHARD FRuIT CROPS IN THE UNITED STATES IN 1919 
(in millions of dollars) 


|. App/es..... 242: 2. Peaches..... 96 3. Plums..... 4i 


4. Pears...... 26 5. Cherries..... 14 6. Apricots... |2 
Source: Yearbook of the Dept. of Agriculture. 


This form is by no means as effective as that recommended 
above. It may be employed, nevertheless, when several different 
distributions are to be compared. Even then, however, certain 
alternative styles are to be considered. These are shown in the 
first two of the three figures of Chart 4. 

No one of these types of diagram is entirely satisfactory. Form 
A is to be preferred, perhaps, when there are but two distributions 
to be shown and the prime object of the graphic exhibit is to 
afford a comparison of the sizes of the constituent groups of the 
individual classifications. The form is difficult to use, however, | 
when there are more than two distributions, as the visual impres- 
sion is then confused by the multiplicity of bars of varying sig- 
nificance. Form B is to be preferred when the prime object is to 


1 Figures placed at the uneven ends of the bars actually distort the impression conveyed 
by the diagram. 
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Cuart 4. CONSTRUCTION OF STEAM VESSELS 100 Gross TONS AND OVER 
IN THE Unitrep States, Unrrep Kincpom, AND OTHER COUNTRIES, IN THE 
FiscaL YEARS 1913 AND 1918 
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show a comparison of the sizes of the several subdivisions recog- 
nized in the classification. Each of these subdivisions is directly 
compared, but the image of each classification as a whole is 
almost entirely lost. Form C brings out the subdivisions shown 
by each distribution, but renders comparison of the sizes of the 
individual constituents somewhat difficult. The form is perhaps 
to be preferred when the distributions have been reduced to 
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percentage form, but is not satisfactory even then when the 
percentages in the different subdivisions show wide variation. 
In general, the conclusion has to be drawn that the graphic 
comparison of two or more classifications by kind cannot be made 
entirely successful. 


B. CLASSIFICATIONS OF DEGREE (QUALITATIVE) 


Not all attributive classifications are by kind. In some, the 
distinctions are of degree; the objects are classified with reference 
to a single attribute the size or intensity of which varies from 
case to case. 

A concrete illustration of the difference between the two types 
may prove helpful. Retail shops in New York City may be 
classified as to line of business as (1) clothing; (2) drug; (3) gro- 
cery, and soon. They may also be classified according to size as 
(1) large; (2) medium-sized; (3) small. The first of these dis- 
tributions is by kind; the second, by degree. 

Classifications by degree are to be subdivided into the (1) qual- 
itative and (2) quantitative. If the work of the students in a 
college course is graded on the common letter system, — viz., 
(A) Excellent ; (B) Good; (C) Satisfactory ; (D) Unsatisfactory ; 
(E) Unacceptable — and the students are then grouped accord- 
ing to their letter grades, the distinctions made are assumed to 
be qualitative, and a qualitative distribution is the result. If, 
on the other hand, the work is graded on a percentage basis and 
the students then grouped in 1o-point classes — thus go-100; 
80-89; 70-79; etc. — quantitative distinctions are set up and a 
quantitative distribution is obtained." 

Two varieties of qualitative distribution are to be recognized : 
(x) those dealing with variables which are quantitatively meas- 
urable; (2) those dealing with variables which are not yet 
quantitatively measurable, though they may later become so. 
An age classification subdividing the population into the 
(a) young, (6) middle-aged, and (c) old might upon further 
investigation and more detailed information be translated into a 


1 Quantitative distributions of this sort are commonly called frequency distributions. 
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classification of the same population by ten-year age groups as 


follows: 


Under 10 years 

ro and under 20 
20 and under 30 
eLc: 


The translation from qualitative to quantitative form here 
clearly involves little more than greater care in the original 
observations, with subsequent tabulation on a strictly quantita- 
tive basis! In contrast, a classification of college students, 
according to their seriousness of purpose, offers no apparent 
opportunity for quantitative distinctions. The classification 
is for the present necessarily qualitative. 

Some distributions, outwardly qualitative, prove upon inves- 
tigation to be essentially quantitative. Thus, if, in classifying 
individuals into young, middle-aged, and old, the actual dis- 
tinction between the first two groups has been drawn uniformly 
at 30 years and between the second and third groups uniformly 
at 60, the classification is really a simple quantitative one. The 
same might be true of the letter grading shown in the first illus- 
tration above, if, in arriving at the letter grades, the instructors 
have rated their students on a percentage basis and have sub- 
sequently converted the percentages into letters on some uniform 
scheme. One must not be misled by the form in which the classes 
of the distribution are indicated. It may be that qualitative 
designations are readily translatable into quantitative. 

Two varieties of quantitative — or frequency — distributions 
are to be recognized : (1) those with equal intervals; and (2) those 
with unequal intervals. Quantitative distributions of equal 
classes constitute one of the most important devices of statistical 
analysis. They are the standard form of frequency distribution. 

Of course, no classification already made can be converted in the manner indicated 
unless details are available for each individual. Commonly no such details are at hand; 
classification of the individuals included within the inquiry has been made by the original 
informants, or observers. The enumerators in a population census may report the age of 
each inhabitant on the basis of a rough classification provided in the census schedules. 
Completion of the age distribution of the population then involves merely a count of the 


individuals reported in the several subdivisions of the classification. Under these circum- 
stances no later refinements of classification are possible without a new set of observations. 
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They are clearly statistical series, and consequently not to be 
dealt with in the present chapter. Quantitative distributions 
with unequal classes, on the other hand, are closely related to 
qualitative classifications and may be briefly considered here. 
Suppose, for example, an age classification is given as follows: 


Under 5 
5 and under 16 
16 and under 21 
21 and under 45 
45 and under 60 
60 and over 
The purpose of a classification such as this may be really to sub- 
divide the population into groups which have distinct economic 
and social significance. The qualities of the individuals in the 
different groups have been kept in mind in setting up the classi- 
fication. The fact that the groups are defined in quantitative 
terms is not of majorimportance. The decisive consideration has 
-been to set up the groups so as to make the individuals within 
each group reasonably homogeneous. The distribution from 
this point of view is essentially qualitative. 

Analysis of qualitative distributions may follow, more or less 
closely, the lines already indicated for classifications by kind. 
A reduction of the numbers in the different classes to percentages 
of the total number of cases in the aggregate is frequently, 
though not universally, desirable, since interest attaches com- 
monly not to differences in the size of the aggregates but rather 
to the manner in which these aggregates are distributed. Re- 
arrangement of the classes in the order of the number of individ- 
uals in each is not ordinarily desirable since this obscures the 
logical relationships involved in the relative degrees of the 
variable attribute. Graphic representation of qualitative dis- 
tributions is about the only opportunity for considerable statis- 
tical elaboration. These graphic methods call for further dis- 
cussion. 

Graphic representation of qualitative distributions is best 
accomplished through the use of bar diagrams. In discussing 
the use of bar diagrams for the representation of classifications 
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by kind, it was pointed out that there are distinct reasons for pre- 
ferring the horizontal to the vertical form. For the representa- 
tion of qualitative distributions, on the other hand, the vertical 
bar diagram is to be favorably regarded despite the convenience 
of the horizontal form. ‘This follows from the fact that in this 
type of distribution approach is made to the quantitative form, 
with which shortly analysis will have to deal.'| In the graphic 
representation of quantitative distributions in standard form, 
it is conventional practice to scale off the variable attribute along 
the horizontal axis and the variable frequencies along the vertical. 
Qualitative distributions, since they are a closely related type of 
distribution, may well follow this lead to the extent of using 
ordinarily the vertical rather than the: horizontal form of bar 
diagram.” The form, then, to be employed is shown in Chart 5. 

Graphic comparison of two or more qualitative or quantitative 
distributions, the frequencies of which are shown in absolute 
numbers or percentages, ordinarily encounters serious difficulties. 
In the first place, effective comparison of different distributions, 
whether of kind or degree, is hardly possible at all unless classi- 
fication has been upon substantially the same basis in the differ- 
ent distributions. This is not as likely to be the case with 
classifications by degree as it is with those by kind. Sometimes, 
however, the distinctions of degree rest upon generally recog- 
nized classifications and the distributions become strictly com- 
parable. Complications on the graphic side still remain. The 
possible forms for graphic comparison have already been indicated 
in the preceding chapter. As there indicated, no one of the 
forms is entirely satisfactory though some may be made to serve. 

The analysis of qualitative distributions can never be made 
thoroughly effective. Little can be done except to compare the 
distributions in some simple fashion as indicated above. Some- 
times the distributions can be employed in the derivation of 


1 It is to be recalled that both qualitative and quantitative distributions fall under the 
more general head of classifications by degree. 


> The general rules governing the construction of bar diagrams are, of course, to be 
pecomnized in this type of application as in others. Spaces may well be left between the 
bers to distinguish this type of diagram from others that will be considered later. 
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CHART 5. PERCENTAGE DISTRIBUTION OF GRADES IN LARGE INTRODUCTORY 
CouRSES AT HarRvArRD UNIVERSITY 


Per Cent 


Grade 


averages, of rates, and of significant coefficients of one kind or 
another. In general, however, statistical endeavor is directed 
toward the reduction of distributions first to quantitative form, 
and second, to quantitative form with equal class intervals. In 
this last form the data open up a whole vista of analysis, as will be 
shown in later chapters.! 


1 This chapter has dealt exclusively with the treatment of distributions in which the 
data express the number of cases falling within specified groups. Some of the methods 
described, however, are equally applicable to data not in the form of frequencies. Take, 
for example, data showing death rates by occupational groupings. The treatment of such 
data might well involve (1) arrangement of the items in the order of magnitude, and (2) 
graphic representation designed to present comparisons of size. The principles governing 
such treatment of the data are the same whether or not the items are in frequency form. 


CHAPTER VI 
FREQUENCY DISTRIBUTIONS 


FREQUENCY distributions state the frequency of occurrence_of 
different values of a single variable attribute, They show how the 


total group of values is subdivided among indicated size-groupings 
when differences of time and place are disregarded. Distributions 
of the frequency type, particularly when set up with equal class 
intervals, possess exceptional significance for statistical analysis. 
Their formulation and presentation are dealt with in this chapter ; 
their analysis will be considered in a series of later chapters. 
Frequency distributions are set up in two distinct forms: 
(x) the simple; and (2) the cumulative. Ordinarily the cumula- 
tive is derived from the simple. Certainly the simple is the more 
fundamental form of the two. It may well be considered first. 


A. SIMPLE FoRM 


A simple frequency distribution is an ordinary classification 
of values by magnitude. The type is a very common one. 
A single concrete Case will serve to illustrate. 


TABLE 15. FREQUENCY DISTRIBUTION OF SPEED AMONG VESSELS OF ANCONA 
LinE, MARcH 31, 1919 


SPEED NUMBER 
(in knots per hour) OF VESSELS 
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It is to be remembered that the frequency distribution is a species of a more general 
type, namely, the attributive distribution. 
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Since differences of magnitude or of number are of the very 
nature of the variable, the simple frequency distribution is a 
form of fundamental significance. The different values of the 
variable, according to which the items are classified, may appear 
either in repeated observations of the same object, or in numer- 
ous observations of related objects. If the physicist attempts 
precise measurement of the length of a metal bar in the physical 
laboratory, he finds that successive measurements differ from 
one another: if a university health department measures the 
height of incoming students, it finds that height among these 
students varies considerably: if the labor office of a large manu- 
facturing establishment records the length of service of employees 
on the company’s payroll, it finds that the periods of service 
vary widely among the employees. Whether the variable 
element is in a single object, or in numerous objects, the general 
type of variation is the same. It is a type of fundamental 
significance, and is the subject matter of some of the most im- 
portant phases of statistical analysis. 

Individual observations, made at any given time and place, 
require systematic treatment before their meaning becomes 
apparent. A run of separate measurements conveys relatively 
little information regarding the character of the variable. Take, 
for example, the following returns showing the percentage of 
members unemployed in 720 trade unions on July 1, 1920: 


TABLE 16. ORIGINAL REPORTS BY 120 UNIONS GIVING PERCENTAGE OF 
MermBers UNEMPLOYED, JULY 1, 1920 


2.5 1.8 3.9 3.0 3.6 3.4 2.9 2.8 5.6 2.0 2.6 ae? 
Ot 4.4 Bil 230) 1.4 Oe 1.6 5.9 el 2.9 1.8 3.6 
3.4 2.9 2.8 2.5 2 1.9 2.2 1.5 1.8 6.3 1.8 1.0 
6.0 27 pis 1.8 Bae Dpf 3.3 2a Pa ino 4.2 Dest 
2D Wee 1.4 4.2 2.8 5.1 ig) ° 3) Ta P27) 3.9 6.2 
4.6 8.4 3.8 Tee) 1.2 2.2 2.0 2.6 Bay 0.9 io) I 
te 4.5 1.0 2.3) 0.9 hoi I.4 1.8 a) 0.8 1.8 ye 
one RE Tie 0.5 1.7 0.2 Tez 5.8 2.2, 1.4 2.3 17 
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9.8 3.4 2.6 2:0 4.8 1.6 59) 2.3 4.0 1g) 2.4 I.5 


It is clear enough that this undifferentiated list of items is almost 
meaningless. Until the data are subjected to analysis, the actual 
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degree of unemployment among these 120 unions can scarcely 
be discerned. 

The same returns, presented in the form of a frequency dis- 
tribution, acquire new meaning. Consider the following table: 


TABLE 17. FREQUENCY DISTRIBUTION OF PERCENTAGE OF MEMBERS 
UNEMPLOYED JULY I, 1920, AMONG 120 REPORTING TRADE UNIONS 


PERCENTAGE UNEMPLOYED NuMBER OF UNIONS 
0.0-9.9 120 
0.0-0.9 5 
I.0-1.9 36 
ZO=2.0) 37 
3-073-9 17 
4.0-4.9 Io 
5:0-5-9 ih 
6.0-6.9 5 
horny fhe) I 
8.0-8.9 I 
9.0-9.9 I 


Here the typical percentage of unemployment, the variability 
of the percentage unemployed, as well as the general character of 
the distribution, all begin to appear. 


1The formation of the frequency table commonly constitutes the first step in analysis 
of the data. At times, however, it pays to start the analysis of the variable by first forming 
an array. An array is merely an arrangement of the original items in the order of magni- 
tude. Thus, we have below an array of the items given above. 


TABLE 17A. ARRAY OF ORIGINAL REPORTS OF 120 UNIONS GIVING PERCENTAGE OF 
MeEmBers UNEMPLOYED 


0.2 1.3 1.6 1.8 2.1 2.4 2.8 3.3 4.1 5.6 
0.5 1g} 1.6 1.8 2.1 2.4 2.0 3-4 4.1 5-7 
0.8 1.4 1.6 1.9 DP Do 2.9 3-4 4.2 5.8 
0.9 I.4 17) 1.9 2.2 2.5 2.9 3-4 4.2 5:9 
0.9 I.4 t7, 1.9 2.2 2.5 2.0 3.6 4.4 6.0 
I.0 I.4 ney 2.0 2.2 2.6 2.9 3.6 4-5 6.2 
1.1 1.5 17 2.0 BD 2.6 3.0 3-7 4-5 6.3 
1.2 1.5 1.8 2.0 2.3 2.6 3.1 B7) 4.6 6.8 
1.2 1.5 1.8 Bei 223 2.7 B22 3.8 4.8 6.9 
Tr) 1.5 1.8 De 2.3 Day 32 3-9 ae ices 
Te Tie 1.8 2.1 2.3 2.8 Be 3-9 5-3 8.4 
1.3 1.6 1.8 2.0 Dee 2.8 2.3 4.0 5-3 9.8 


As will readily appear, the array is not an effective statement of the character of the variable. 
The array is useful primarily as an aid in setting up the best possible frequency distribution, 
—_— i determining the width of the classes, and their limits. For this purpose it may well be 
employed when the number of items is small. When the number of items is large, the 


frequency distribution is ordinarily obtained at once on the basis of some classification which 
appears to promise satisfactory results. 
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The translation of data into a frequency distribution raises a 
number of questions. First, how wide should be the classes into 
which the individual returns are thrown? Thus, in the illustra- 
tion given, should the classes cover a range of one-half of one per 
cent, or a full per cent, or some other amount? This is the 
question of the classinterval. Second, where should the limits of 
the classes be located? Should the first class above, for example, 
run from o.o to o.g or from 0.5 to 1.4? Third, how should the 
classes be designated? To suggest only two possibilities, the first 
class in Table 17 might read ‘‘o0.0 to 1.0”; or, instead, “‘o.o and 
under 1.0.”’ The best form of designation has to be ascertained. 
Fourth, how many items actually fall into the designated classes ? 
The variable must be distributed before a series is forthcoming; 
the-frequencies must be found. These several questions involve 
points which require careful analysis. 

The first point to be emphasized in connection with the question 
of the width of the classes is the importance of uniformity. If 
the classes are indeterminate in width, or if they are manifestly 
unequal, the distribution is, from many points of view, seriously 
defective. The great disadvantage of unequal classes lies in the 
resultant ambiguity of the number of cases in the several classes. 
Take the following example : 


TABLE 18. NUMBER OF GARMENT MANUFACTURING CONCERNS IN I9I7 
GROUPED ACCORDING TO INVESTED CAPITAL 


INVESTED CAPITAL NUMBER OF CONCERNS 
Under $10,000 7 
$10,000-$20,000 29 
$20,000-$50,000 108 
$50,000-$100,000 57 
$100,000-$200,000 37 
$200,000-$500,000 25 
$500,000-$1 000,000 9 
$1,000,000-$2,000,000 7 
$2,000,000-$5 ,000,000 2 


Source: Treasury Report on Corporate Earnings and Government Revenues for the year 
I9I7, D- 45- 
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The numbers in the several groups of this table reflect in no small 
measure the differences in the widths of the classes to which they 
correspond. Thus, the number 29 in the second group relates 
to a class which is only one-third as large as that for which the 
number 108 is given; the number 108 is attached to a class which 
is only three-fifths as wide as the class in which only 57 cases are 
found. Thus, the significance of the number of cases appearing 
in the different classes is not altogether clear. The general 
relationship between the number of concerns and the amount of 
invested capital in each is not brought out. From many points 
of view, this obscurity in the distribution constitutes a serious 
defect. Distributions with equal classes, on the other hand, 
express clearly the relationship between the variable attribute 
and the variable frequency of occurrence. The importance of 
uniformity in the width of the classes can hardly be ex- 
aggerated.! 

Though uniformity in the groupings is of great importance in 
quantitative classifications, distributions are not infrequently set 
up with unequal classes. Sometimes, as noted in the previous 
chapter, the disparity of the classes results from an attempt to 
get homogeneous groupings ; the classification is essentially quali- 
tative, though quantitatively stated. In other cases, the 
explanation of inequality in the classes is to be found in mere 
convenience of tabulation. Or, it may be that certain unequal 
classes have come to be recognized and the individual cases are 
thrown into these as a matter of course at the time of the original 
reporting. Table 19 serves as an illustration. 


1 This rule of equal classes does not preclude, however, the use of more detailed sub- 
divisions of the variable for that part of the range in which the variation is most significant. 
If supplemental information is desirable for this part of the range, it may be given by dividing 
the intervals into smaller equal parts. With the classification given in this form, it is a simple 
matter to obtain a frequency distribution of equal classes by merely adding the frequencies 
of the smaller classes given for the more detailed portion of the distribution. 

The common practice of using indeterminate classes (open ends) at either extremity of 
the frequency distribution is obviously a violation of the rule of equal class intervals. Un- 
doubtedly such indeterminate classes are convenient as far as tabulation is concerned, but 
from the point of view of statistical analysis a frequency distribution with open ends is an 
obstruction. If for any reason it seems necessary in the interest of compact tabulation 
to employ indeterminate extreme classes in a frequency distribution, the exact size of the 


items falling within such classes, or, in any case, the average size of such items, should be 
stated in connection with the tabulation. 
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TABLE 19. FREQUENCY DISTRIBUTION OF LENGTH OF SERVICE AMONG 
EMPLOYEES WHO LEFT DURING Srx MontTHs’ PERIOD 


NUMBER OF Em- 


EN -OF- 
LENGTH-OF-SERVICE PERIOD PLOYEES 


One=week*onlessi eee jeet ie eae Me ca Gh ees 106 
OVS CRS WCB UO UNO WES 5 6 o 6 a 6 60 me 92 
Over twomweeks to oneamonthy ee een 225 
Overone month toithreeimonths {722 5 450 
Owes WO) Ske mM, . 4 o a a 6 &@ » o o 8 257 
OnGr arene se . 5 5 as e 6 5 Hom 194 
OMGF OO SEI USO WEES 5 6 Go 6 a 6 oe 79 
Overtwonyecarsitostntecry cars aan mmm iens jin. 15 
Ovembhireesy earsitOnivery Calc mln es yr ns meen ne 10 
Overivervearsage = ae eB oe so) eae 9 


Doubtless at still other times inequality of the classes reflects 
little more than the failure of those in charge of the tabulation to 
recognize the strong presumption in favor of equal classes in all 
quantitative distributions. 

The requirement of uniform class intervals in frequency dis- 
tributions is sometimes to be interpreted logarithmically instead 
of arithmetically ; that is, relatively instead of absolutely. Un- 
doubtedly the case is a special one, but it is to be recognized none 
the less. The method involved can be best explained in a con- 
crete illustration.1 

Suppose that an analysis is being made of price relatives 
expressing the ratios of individual commodity prices at a given 
date — say May 1, 1922 —to the prices of an earlier date — 
say May 1, 1914.” The actual ratios are found to range from 
63 to 395. The distribution of the ratios in simple frequency form 
is shown in Table 20. The variable here is subject obviously to 
a special condition: it may rise by any amount above 100, say 
by 300 or 400; it cannot fall by more than 100 below 100 — it 
cannot become less than o. The distribution, arithmetically 
considered, is almost certain in any period of rising prices to 
skew toward the upper end of the range. 


1 The illustration is from the Review of Economic Statistics, prel. vol. IV, 1922, pp. 195-106. 
2 Bradstreet’s 96 commodity quotations may serve for this purpose. The quotations are 
to be found in Bradstreet’s, May 15, 1915, p. 311, and May 13, 1922, p. 300. 
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TABLE 20. FREQUENCY DISTRIBUTION OF PRICE RELATIVES 
oF 96 ComMmopiTIES REPORTED BY BRADSTREET’S 
For May 1, 1914, AND May 1, 1922 


(May 1, 1914 = 100) 


PricE RELATIVES NuMBER OF COMMODITIES 
61— 80 5 
81-100 9 

IOI-120 I2 
I2I-140 I5 
141-160 19 
161-180 13 
181-200 II 
201-220 4 
221-240 2 
241-260 ie 
261-280 2 
281-300 ° 
301-320 ° 
321-340 I 
341-360 ° 
361-380 I 
381-400 I 


Furthermore, it may reasonably be argued that from some 
points of view an increase in a relative from 100 to 200 is no more 
than equivalent — though opposite — to a decrease from 100 
to 50. In other words, the price changes are to be viewed as of 
equal significance if equal proportionately, not absolutely. This 
suggests that the requirement of equal intervals in the frequency 
distribution be so interpreted as to preserve a constant ratio 
between successive classes rather than to preserve a constant 
difference. 

This principle is introduced in the frequency distribution shown 
in Table 21. Here class intervals are laid off so that each class 
limit stands just 15 per cent above the next lower; thus, 72 
Is 15% above 62, 83 is 15% above 72, and so on! With 
the intervals so arranged, the distribution ceases to be mod- 
erately asymmetrical and becomes essentially symmetrical. 


‘ a a scale of equal ratios is easily obtained through the use of a log table or, more 
a ely, by marking off equal distances on paper which has been logarithmically ruled. 
e classes are most easily located in terms of their lower (or upper) limits. 
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From some points of view, it presents a truer picture of the 
character of the variation than does the distribution given in 
Table 20. 


TABLE 21. FREQUENCY DISTRIBUTION OF PRICE RELATIVES 
oF TABLE 20 ON Ratio (or LoGARITHMIC) SCALE 


Prick RELATIVES NuMBER OF COMMODITIES 
62— 7x I 
V2— 62 4 
83- 94 7 
95—-108 5 

IOQ-124 II 
125-143 15 
144-165 20 
166-190 16 
191-218 9 
219-251 2 
252-289 8 
2900-333 I 
334-384 I 
385-440 I 


Turning now to the choice of the class interval, we find that a 
number of considerations apply. These may be briefly stated as 
follows : 


eae 


) 


The interval should be of such size as to bring out the 
characteristic structure of the distribution; otherwise, the 
classification fails of its primary purpose. 

The interval should not _be so small as to lose the advantage 
of convenient summarization. In other words, the fre- 
quency distribution should compress the record of the 
variable into much smaller compass. 

At the same time, condensation must not go too far. In 


particular the interval should not be so large that serious 
error is involved in assuming that all the items of any class 


are equal to the mid-value of the class (the value of the 
variable halfway between the upper and lower limits of the 


class).} 


1 This consideration is fundamental since this very assumption is made repeatedly in 
subsequent analysis of the variable. The matter will be covered in a later chapter. 
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4. The interval should not be so small as to result in numerous 
vacant classes in the significant parts of the range of the 
variable. 

s. If the data show a tendency to concentrate on round 
numbers, each class should contain an equal number — 
preferably only one — of these numbers. 

6. If the variable is discrete, commonly the single unit is the 
natural interval unless the range is great. Thus, if we are 
seriating returns giving the sizes of families in a certain 
population, the classes will naturally run: 2, 3,4, ... up 
to the maximum number in any reported family. 

7. Where a number of similar frequency distributions are to 
be compared, an effort should be made to obtain an interval 
which will serve satisfactorily for all the distributions. 

8. In general, intervals are to be preferred which are based 
upon common and convenient divisions of the scale: thus, 
classes of five units are more convenient than classes of 
three. 


These several rules should be carefully observed in the selec- 
tion of the class interval. They may be supplemented, further- 
more, by a rule of thumb which is commonly cited; namely, that 
an interval which casts the variable into 15—25 classes will ordi- 
narily be found most satisfactory.! In applying this rule the 
range of the variable is divided by some number between 15 and 
25 and an approximate value for the class interval thus obtained.” 
A more convenient interval of about the same magnitude is then 
selected. 

The width of the class having been fixed, the next step is to 
locate the several classes on the scale of the variable. Thus, 
if the class interval covers five points —as it might in the 
percentages of unemployment shown above — the classes may 
be placed in five different positions: 1.0-1.4, etc.; 1.1-1.5, etc. ; 

1Vule, Introduction to the Theory of Statistics, p. 79. Rugg, in Statistical Methods 


A pplied to Education (pp. 83-86), suggests 10-20 classes instead of 15-25. The rule is 
sometimes given as “‘not less than 12 classes.” 


2 By the range of the variable is meant the difference between the highest and lowest 
values of the variable. By the effective range is meant this difference after stray items, or 
scattered items widely removed from the others, have been eliminated. If there is a marked 


difference between the expressed range and the effective range, the latter is to be used in 
following the above rule. 
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Do2=t.0, tC oats 1-7, etc, 1:4-1o,7etc. Upon what? principle 
is a choice to be made from among these possibilities ? 

The considerations governing the location of the class are as 
follows: 


I. 


When the location of the class appears to affect considerably 
the form of the distribution, that position of the classes 
should be chosen which brings out most clearly the apparent 
structure of the distribution. In meeting this requirement, 
study of the array will sometimes be of assistance. 


Classes should be so located as to justify as fully as possible 


_ the assumption that their average size corresponds to the 


mid-value of the class. It follows that when there is a 
clustering of items about certain values of the variable, 
these values should be placed, if possible, at the middle of 
class intervals. (This rule should not be thought to give 
warrant, however, for unequal class intervals.) 

It is sometimes convenient to make the mid-points of the 
classes integers. 

Intervals starting and ending with integers — or with simple 
fractions or multiples — and using the basic tens-system — 
are conducive to ease and accuracy both in tabulation and 
analysis. (Because of the employment of short-cut 
methods of calculation — explained later — this rule com- 
monly supersedes rule 3.) 

The classes should be so placed as to preserve equal intervals 
at the extremes of the range. ‘Thus, in the case of the un- 
employment data cited in Table 17, it is better to have the 
limits of the first class 0.0 to 0.9 than o.1 to 1.0, for otherwise 
the item o.0 falls outside and cannot be provided for in a 
series of equal intervals. Slight inequality in terminal 
classes, however, is sometimes unavoidable, e.g. in a dis- 
tribution running from 0.0 per cent to 100 per cent. In 
such cases the limits should be put where they will cause the 
least disturbance. 


Mention has been made of the class limits. By a class limit 
is meant the precise point at which individual items are separated 
into the lower and the upper of two adjacent classes. Accurate 
determination of the limits is an important matter since it affects 
many of the averages and coefficients which are commonly derived 
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in the analysis of attributive variables. The exact class limits 
should be noted in connection with the analysis of each dis- 
tribution.+ 

In the case of a discrete variable, the class limits will ordinarily 
be self-evident. Thus, if retail stores are being classified accord- 
ing to the number of clerks in each, and the classes are 1-25, 
26-50, 51-75, etc., it is obvious enough that 1, 26, 51, etc., are 
the lower limits of the successive classes, and 25, 50, 75, etc., the 
upper limits. No complications arise here. 

If, on the other hand, the variable is continuous, the case is 
by no means as clear. Suppose that the individual heights of a 
group of athletes are measured to the nearest quarter-inch. 
These records are thrown subsequently into a frequency table 
with the following classes : 


5) Ole 50a! 
ats San 
5 8” — 5’ 82”, and-so on 


It might appear from these designations that the lower limits 
of the classes are 5’ 6’’, 5’ 7’’, 5’ 8’’, and so on. As a matter 
of fact, a little reflection upon the accuracy of the original 
measurements will make it clear that any man whose height was 
5’ 6+%'’ will have been thrown into the second class, and any 
one whose height was 5’ 7+2’’ into the third. The real limits 
of the classes are not identical with the expressed limits. In 
general, actual class limits can be ascertained only when the 
precision of original measurements and the disposition of line 
cases are known.? With such information at hand, the actual 
class limits can be positively stated. As will later appear, dif- 
ferences between actual and expressed limits often need to be 
carefully noted in dealing with frequency distributions. 


1 The value of the variable halfway between the upper and lower limits of a class is 
called the mid-value of the class. 

* Line cases are ordinarily given the next highest value; thus, if measurement is to ae 
nearest quarter-inch and an individual appears to be exactly 5’ 64’’, a report of 5’ 64” 


made. Line cases may, however, be dealt with otherwise: say, SP throwing them up ae 
down in equal proportion. 
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After the width, location, and limits of the classes have been 
determined, it remains to give the classes appropriate designa- 
tions. A number of different styles are employed. Thus, in 
Table 17 (see page 64) the stub might be given in any one of the 
following ways :! 


(a) (0) (c)! 
0.0- 1.0 o.o and under 1.0 0.0-0.9 
I.0- 2.0 TO mc 2.0 I.0-1.9 
2.0= 3.0 Delay Te 3.0 2.0-2.9 
3.0- 4.0 oy ee 4.0 3.0-3.9 
4.0= 5.0 Tivfeyy Mie VE 5.0 4.0-4.9 
5.0- 6.0 SO Me 6.0 5.0-5.9 
6.0- 7.0 (yo 7.0 6.0-6.9 
7.0- 8.0 oe Oe 8.0 7.0-7.9 
8.0- 9.0 Serep eo 9 9.0 8.0-8.9 
Q.0-10.0 CO 1Om TOO 9.0-9.9 


Of these three forms, (a) is unsatisfactory because nominally 
ambiguous ; the classes overlap — 1.0, for example, appears in 
both of the first two designations.* It is true, of course, that 
this style of designation is ordinarily read 0.0 and under 1.0, 
1.0 and under 2.0, andsoon. But if this is so, why not state 
the matter explicitly, as form (b) does? The argument seems 
conclusive, yet form (a) remains in wide and reputable use. 

Form (0) is clear in meaning, but somewhat cumbersome. 
Furthermore, it does not indicate the actual class limits. These 
are ascertainable under form (b) only when the accuracy of 
measurement in the observations is somewhere definitely stated. 
The class designation itself does not give this information. 

Form (c) meets these objections completely. It is simple and 
unmistakable in meaning. Furthermore, it can be used to 
indicate the precision of measurement in the original returns. 
The designations of the frequency distribution shown in Table 17 


1 Class designations may also be indicated in two other quite different ways: (d) by ~ 


specifying the mid-point of the successive classes, thus: .5, 1.5, 2.5, etc., or (e) by specifying 
the variable to the nearest unit, thus: 0, 1, 2, 3, etc. Forms (d) and (e) are both satisfactory 
under some circumstances, but neither is in common use, and neither is ordinarily to be 
recominended. 

2 Tf line cases have been thrown both up and down, the overlapping of the designations 
is correct. But this is not usually the case. 
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indicate that the 120 unions in question reported unemployment 
among their memberships to the nearest tenth of a per cent. 
Information regarding the precision of measurement in the 
original observations should be given whenever data are sub- 
mitted to analysis, and, if not carried in the class designation 
itself, should be supplied in some other clear form. In general, 
however, it seems best to use form (c) in giving class designations, 
making sure that the designations indicate exactly the precision 
or accuracy of the original measurements. 

With the intervals properly set up, formation of the frequency 
distribution is merely a matter of counting the number of cases 
falling within each individual class.!_ If the number of items is 
large (say, 1,000 or more), the counting should be done by tabu- 
lating machines. If, however, the number is relatively small, the 
aggregates in question may be readily obtained by checking the 
items into a frequency table.2 When complete, the distribution 
appears asin Table 17. (See page 64.) 

Now that the methods employed in setting up the frequency 
distribution have been made clear, the purpose of the device 
should be evident. In brief, it is effective summarization. By 
means of the frequency distribution, the variable is more easily 
surveyed and its special characteristics brought to light. Errors 
of individual observation tend to disappear, and irregularities 
in the form of the variable are revealed. Furthermore, compari- 
son is greatly facilitated. In fact, it is only by means of sum- 
marization of one kind or another that comparison is made 
possible. The frequency distribution is one of the most im- 
portant summary forms employed in statistical analysis. 

At the same time it should be realized that the frequency dis- 
tribution suffers the limitations of all condensed or summary 
statements. There is unavoidably a loss of detail. These 

t Line cases may sometimes be disposed of by refining the measurement of the variable. 
If this is impossible, they may be apportioned to the adjacent classes either in equal 


numbers or in the ratio of the frequencies before the’ line cases are assigned. See Yule, 
Introduction to the Theory of Statistics, pp. 80-81. 


* See Appendix C for a brief discussion of the mechanics of tabulation. 


Conversion of the frequencies to percentage form is usually desirable when comparisons 
ave to be made. 
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details are only to be found in the individual records. Care 
should be exercised to prevent the picture given by the frequency 
distribution from misrepresenting in any way the character of the 
variable as disclosed in the individual observations. 

Graphic representation of simple frequency distributions as- 
sumes three forms: (1) the column diagram — sometimes called 
the histogram ; (2) the frequency polygon ; and (3) the frequency 
curve. These are illustrated in the separate diagrams of Chart 6. 

Certain features are common to all three forms of frequency 
diagram. ‘Thus, the scale of the variable size of item — the in- 
dependent variable —is horizontally placed; the scale of the 
variable frequency of occurrence — the dependent variable — 
vertically. The figures designating the scales of the two variables 
are shown at the bottom along the horizontal, or x, axis, and on 
the extreme left-hand edge of the diagram along the vertical, 
or y, axis! In other respects, the three forms of diagram, 
though resting at bottom upon the same principles of con- 
struction, follow somewhat different lines. 

In drawing a column-diagram, the axes and scales having been 
already placed, perpendiculars are first erected at the mid- 
points of the successive classes of the distribution. These 
perpendicular lines are then cut off at heights corresponding to 
the respective frequencies of the several classes. Finally, a 
series of rectangles are constructed, each rectangle having as its 
base a single class interval and as its altitude, one of the per- 
pendiculars previously drawn.’ To bring the columns into sharp 

1 The zero of the scale of frequencies should always appear at the foot of the vertical 
axis. It has been argued that the zero of the scale of the variable attribute should also be 
included in the figure and should appear at the left-hand end of the horizontal axis. (See 
Earle Clark, ‘‘ The Horizontal Zero in Frequency Diagrams,” Quarterly Publication Amer- 
ican Statistical Association, June, 1917, pp. 662-669.) The argument for this practice, how- 
ever, is not clear. Sometimes it is not the relative spread but the geometric form of the 
distribution which is of interest. But in comparisons of two or more distributions, there 
is a fair presumption in favor of a horizontal scale which runs to zero. Such a scale may 


well be adopted if it is determinable, and if the zero is not so remote from the actual sizes 
occurring in the distributions as to require an excessively small scale for its inclusion in 


the diagram. ‘ r : 

2 Sometimes only enough of the rectangles is shown in the finished drawing to give the 
appearance of an ascending and descending flight of stairs. (See National Bureau of Eco- 
nomic Research, Income in the United States, vol. I, p. 128.) This is not, however, the 


common practice. 
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CHART 6, FREQUENCY DISTRIBUTION OF PERCENTAGE OF MEMBERS 
UNEMPLOYED JULY 1, 1920; AMONG 120 REPORTING TRADE UNIONS 


Number of Unions 


40 
35 
30 
25 
A 
Column 20) 
Diagram 
15 
10 
5 
(0) 
0 10 20 30 40 50 60 70 80 90 100 
B 
Frequency 
Polygon 
0 10 20 30 40 50 60 70 60 90 100 
40 
35 
30 
25 
Cc 
Frequency 20 
Curve 
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10 
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== 
0 10 20 30 40 50 60 70 80 90 10.0 


Percentage of Members Unemployed 
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relief they may be colored or cross-hatched. The figure appear- 
ing as diagram A in Chart 6 was drawn to these rules. 

In drafting the frequency polygon, the steps taken in construct- 
ing the column diagram are followed without change until the 
series of perpendiculars, erected to the height of the respective 
frequencies, have been drawn. Then, instead of constructing 
upon these perpendiculars a series of rectangles, the tops of the 
perpendiculars are connected with straight lines. Straight lines 
are also run from the tops of the perpendiculars at each end of the 
distribution to the base line at the mid-point of the class-inter- 
vals just beyond — on the assumption, of course, that the fre- 
quencies in these classes are zero. This standard form appears 
as diagram B. 

Under certain conditions to be discussed presently, a smooth 
curve instead of a series of straight lines may be drawn to con- 
nect the tops of the perpendiculars representing the successive 
frequencies of the distribution.’ The resulting figure is called a 
frequency curve. Full development of the subject of frequency 
curves carries one into analyses, commonly mathematical, lying 
quite beyond the scope of an elementary text. Satisfactory 
frequency curves may be obtained at times, however, by simple 
free-hand smoothing. When this elementary method is em- 
ployed, care should be exercised to keep the areas under different 
segments of the curve approximately equal to the areas under 
corresponding portions of the frequency polygon. The data 
represented in the column diagram and frequency polygon shown 
opposite are presented again, in diagram C, in the form of a 
frequency curve. 

With three forms of diagrams available for representing fre- 
quency distributions, upon what basis is a choice to be made? 
Two or three criteria are in general to be considered: (1) appli- 
cability of the form of diagram in view of the character of the 
variable being represented ; (2) accuracy of the figure as a geo- 
metric representation of the particular distribution in hand; 
(3) visual effectiveness of the graphic form. With reference to 
the last of these criteria, it would appear — though opinions 


78 STATISTICAL ANALYSIS 


here may differ — that the frequency curve is most effective, 
and the column diagram least satisfactory, in giving a simple 
graphic impression of the structure of the distribution. But the 
use of the frequency curve is restricted by the fact that it is only 
applicable when the plot of the frequencies is suggestive of a 
smooth curve, and when the character of the variable warrants a 
certain assumption; namely, that if a larger and larger number 
of observations were included in the analysis, and the classifica- 
tion of the variable were made more and more detailed, a smooth 
curve would ultimately result. Commonly, no such assumption 
can be made. Choice has to be confined, therefore, in most 
cases, to the column diagram and the frequency polygon. Be- 
tween these two, the advantage of graphic effectiveness would 
seem to lie in general with the frequency polygon. Use of the 
column diagram nevertheless is definitely indicated in two cases : 
(1) when the number of items in the several classes is small ; and 


TABLE 22. PERCENTAGE DISTRIBUTION OF WEEKLY WAGES AMONG MALE 
WEAVERS IN CoTTon Mitts or Sratres A AND B 


PERCENTAGE OF TOTAL WORKERS IN STATE 
WEEKLY WAGES 


Stale A State B 
$ 2.00-21.99 100.00 100.00 
2.00- 2.99 0.7 2.5 
3.00- 3.99 T.2 2.9 
4.00- 4.99 1.4 3.0 
5.00- 5.99 ey, 4.1 
6.00- 6.99 1.9 4.2 
7-09- 7.99 2.8 4.6 
8.00- 8.99 3.0 5:3 
9.00- 9.99 3.6 T226 
10,.00-10.99 3.6 15.5 
II.0O-I1.99 4.0 14.9 
12.00-12.99 4.8 10.0 
I3.00-13.99 7.8 8.0 
I4.00-14.99 77, 6.7 
I5.00-15.99 DS 4.7 
16.00-16.99 II.9 0.7 
17.00-17.99 IO.1 
18.00-18.99 5.6 oe 
I9.00-19.9G 2.9 SS 
20,00-20.99 2 ae 


21.00—-21.99 0.3 ——— 
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(2) when the variable is discrete — especially when the values it 
may assume are relatively few in number, e.g. in size of family. 
In general, unless the smooth curve is clearly appropriate, or 
the column diagram specifically indicated, the frequency polygon 
will be found most satisfactory. 

Although all three forms of diagram are widely employed in 
the graphic presentation of single distributions, the polygon and 
curve serve much better than the column diagram when two or 
more distributions are to be compared. ‘Take, for illustration, 
the two distributions given in the table opposite. These dis- 
tributions are effectively contrasted in Chart 7. 


CHART 7. PERCENTAGE DISTRIBUTION OF WEEKLY WAGES AMONG MALE 
WEAVERS In Cotton Mitts oF States A anp B 


Per 
Cent 


2 4 6 8 (Oh oir 14 16 [Cee O mae Come 24, 
Weekly Wages (in dollars) 


1 The distributicns have been reduced to percentage form since it is assumed that it is 
the form of the distributions upon which attention is to be focused in the chart. If the 
absolute numbers of workers in the different wage groups were important, the frequency 
distributions of actual numbers would be charted. 
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The use of the column diagram could hardly exhibit the dif- 
ferences as clearly. Two polygons, or curves, superimposed 
upon one another by being constructed on the same axes, afford 
the best graphic comparison of simple frequency distributions. 

In the illustrative case just shown — weekly wages among 
cotton weavers in two different states — the two variables are 
reported in the same unit. Frequently this is not the situation : 
the scales of the two variables are in different units. This intro- 
duces complications. Sometimes when the units are different, 
they are convertible; that is, the relation of the one unit to the 
other is accurately known. Thus, output may be reported in 
one country in short tons, in another, in metric. Such dif- 
ferences do not seriously interfere with analysis, though they do 
sometimes involve conversions from one unit to the other. 
On the other hand, when the units of the two variables are of 
essentially different kind — e.g. inches in the one case, pounds 
in the other — caution must be exercised to make sure that the 
scales used in graphic comparison give a fair portrayal of the 
relationship of the variables.” 


B. CUMULATIVE FORM 


The frequency distributions thus far considered have been of 
the so-called simple type ; the frequencies have referred to equal 
and mutually exclusive subdivisions of the range of the independ- 
ent variable. Another type of frequency distribution, though 
not as common as the simple, is increasingly used, and has im- 
portant applications in statistical analysis. This second type 
is known as the cumulative. 

The general character of the cumulative distribution will be 
evident from examination of the following table in which a simple 


distribution and the corresponding cumulative form are shown 
in parallel columns. 


1 Two independent column diagrams side by side would probably give the best results 
obtainable with this type of figure. 


2 The adjustment of scales here involves questions which can be dealt with only after the 
treatment of certain later subjects. 
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TABLE 23. FREQUENCY DISTRIBUTIONS OF PERCENTAGE OF MEMBERS 
UNEMPLOYED JULY 1, 1920, AMONG 120 TRADE UNIONS 


CumuLaTive DIstRIBUTION 
Smee DIsTRIBUTION 
(A) (B) 
CUMULATIVE FROM SMALLEST CUMULATIVE FROM LARGEST 
: No. of Unions No. of Unions No. of Unions 
pens Reporting Percent- Che Reporting Percent- Cease A Reporting Percent- 
age Unemployed y age Unemployed y age Unemployed 


0.0-0.9 5 0.0-0.9 5 9.0-9.9 I 
1.0-1.9 36 0:0 41 8.0-"* 2 
AVS AIA0) 37 Or) 78 io 3 
3.0-3.9 17 eg) 95 C.o- 8 
4.0-4.9 10 [24.0 105 Gem 15 
5.0-5.9 7 eo) 112 Ao 25 
6.0-6.9 5 020 117 2.0- °° 42 
7.0-7.9 I 746) 118 2.0— 79 
8.0-8.9 I “—8.0 TIQ T0= II5 
9.0-9.9 I + 00) 120 Co= 120 


It is clear that the frequencies of the cumulative distribution 
relate to wider and wider segments of the scale of the independent 
variable, each segment being set off from one extreme or the other 
of the range. In the simple frequency distribution there must 
be no overlapping of the classes; on the other hand, overlapping 
is of the very nature of the cumulative distribution. Successive 
classes of the simple frequency distribution are mutually exclu- 
sive; those of the cumulative are successively inclusive of all 
that have gone before. The frequencies of the cumulative dis- 
tribution, in other words, cumulate from one end of the distribu- 
tion toward the other, the most inclusive class containing the 
total number of items covered by the distribution. 

Cumulative distributions assume two forms, depending upon 
which end of the scale is taken as the point from which to cumu- 
late. Ordinarily, cumulation is from the lower end of the range 
toward the upper; but the reverse arrangement may sometimes 
be used to advantage. All depends upon the purpose of the dis- 
tribution. The former of the two arrangements is commonly 
referred to as the ‘‘less than”? arrangement, the other as the 
“oreater than.’ Formulation of the two types rests upon 


82 STATISTICAL ANALYSIS 


essentially the same principles. The two forms are both shown in 
Mable 23: 

Cumulative frequency distributions are usually derived 
directly from the simple. Addition of intervals and of frequencies 
is all that is involved. Obtained in this way, the cumulative 
distribution encounters practically no complications in the deter- 
mination of the width, location, limits, or designations of the 
classes. These are all obtained as obvious corollaries of the 
classes of the simple distribution. In other words, the cumula- 
tive distribution is a direct derivative from the simple. 

One slight complication may arise in casting a simple dis- 
tribution into cumulative form. ‘This is in giving the designa- 
tions correct cumulative form. A number of forms are employed. 
Thus, the designations of the cumulative distributions shown in 
Table 23 might be given as follows: 


(A) ‘0.9 and under, 1.9 and under,” 
“Less than 1.0, less than 2.0,” etc. 

(B) ‘9.0 and over, 8.0 and over,” etc., or “Over 8.9, over 7.9,” ! etc., or ‘Greater 
than 8.9, greater than 7.9,” etc. 


etc., or ‘‘Under 1.0, under 2.0,” etc., or 


While these styles of designation are all in common use, the form 
employed in Table 23 (Sections A and B) is recommended as 
indicating more definitely than the others the actual width of the 
successive classes of the distribution.? 

Graphic representation of cumulative distributions is ordinarily 

_by_a smooth _curve.2 The standard form is Shown in Chart 8. 
The data are given in Table 24. 

The underlying principles observed in the drawing of a curve 
of this type — commonly called an ogive — are identical in most 
respects with those followed in setting up the corresponding 
simple form. Thus, the figure is constructed on coérdinate axes, 
the scale of the variable attribute being laid off along the x-axis, 

1 Tt would be in error to give the designation here as “Over 9.0, over 8.0,” etc. 

It should be noted that no form of designation indicates unmistakably the precision 


with which the magnitudes of the variable have been recorded. This point, consequently, 
should be covered by a direct statement. 


* Much might be said in favor of the column form for certain cumulative distributions, 


but as a matter of fact, column and straight-line diagrams are not employed to represent 
cumulative distributions. 
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the scale of the variable frequencies along the y-axis.'_ In only 
one particular is an important difference to be noted: the per- 
pendiculars erected in the construction of the figure are to be 
raised from the class limits and not from the mid-points. 


CUMULATIVE PERCENTAGE DISTRIBUTION OF WEEKLY WAGES AMONG MALE 
WEAVERS IN Cotton MILLs oF State A 


TABLE 24 CHART 8 

WEEKLY WAGES | es 
i Cumulative 

$2.00- 2.99 0.7 Percentage 

3-00- 3.99 1.9 _ 

4.00~ 4.99 3-3 90 

5-00- 5.99 5-0 a 

6.00- 6.99 6.9 

7.00- 7.99 9.7 70 

8.00- 8.99 227) 60) 

9.00- 9.99 16.3 

I0,00-10.99 19.9 50 

II.0O-I1.99 23.9 40 

I2.00-12.99 28.7 30) 

13.00-13.99 30.5 

I4.00-14.99 54.2 20 

15.00-15.99 67.9 10 

16.00-16.99 79.8 4 

17.00-17.99 89.9 FST ee ee Te ee 

18.00-18.99 95-5 Weekly Wages (in dollars) 

IQ.00-19.99 98.4. 

20.00-20.99 99.6 

21.00-21.99 99-9 


The meaning of the cumulative curve is by no means as self- 
evident as that of the simple frequency diagram. Study of 
cumulative diagrams will bring out a number of significant 
points. It is clear, for example, that an ascending type of cu- 
mulative curve can never fall in any part of its course. A fall 
would be equivalent to a negative frequency in the simple fre- 
quency distribution — a possibility which ordinarily need not be 
entertained. Again, a relatively steep stretch in the curve cor- 


1 Of course, the scale on the vertical axis has to be so drawn as to accommodate the 
maximum cumulative frequency (i.e. the total number or percentage of cases in the dis- 
tribution) instead of merely the maximum simple frequency in a single class, as in the simple 
form. 
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responds to a portion of the simple distribution in which fre- 
quencies are high; a relatively flat stretch, to a section of the 
simple distribution in which frequencies are low. Again, the 
location of the curve as it passes opposite the midpoint of the 
vertical scale shows whether the frequencies of the simple dis- 
tribution tend to mass at the middle or toward one side. Thus, 
if the curve passes opposite the mid-point of the vertical scale 
over a point more than halfway across the horizontal scale, the 
frequencies are massed toward the upper part of the scale of the 
independent variable. The ogive shown in Chart 8, for example, 
corresponds to a simple frequency distribution in which the 
frequencies tend to mass in the upper part of the scale. Once 
there is thorough familiarity with such points as these, the 
cumulative frequency curve can be as readily interpreted as the 
simple. It then becomes a more useful instrument of statistical 
analysis. 

One or two special features of the cumulative diagram are of 
special importance. In the first place, the figure tends to assume 
a fairly smooth form; irregularities in the data are largely elim- 
inated through the process of cumulation. In the second place, 
in representing the cumulative distribution unequal class inter- 
vals are not particularly bothersome. Of course, they must be 
carefully regarded in the construction of the cumulative curve, 
but irregularity of spacing in the plotted points determining the 
curve does not appreciably affect its location. At times these 
features of the cumulative graph give it substantial advantages 
over the corresponding simple form. 

Comparison of cumulative distributions follows the lines 
suggested in dealing with the simple. Ordinarily, it is desirable 
to reduce the distributions in the first place to percentage form, 
since differences in the total number of items are of no con-- 
sequence. Graphic comparison of two or more cumulative dis- 
tributions may then be effected by superimposing them on the 
same axes. Chart 9 affords an illustration. 

1 The more nearly the ogive assumes the position of a diagonal, the more nearly the 


corresponding simple distribution must approach a rectangular or flat form in which fre- 
quencies are approximately equal throughout the distribution. 
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CHART 9. CUMULATIVE PERCENTAGE DISTRIBUTIONS OF LENGTH OF SERVICE 
AMONG EMPLOYEE Exits In Firms A AND B 

Cumulative 

Percentage 


| 


Be Ul 


| 


po 


4 5 6 Ui 8 10 
Years of Service 


Graphic comparison of two cumulative frequency distributions 
is obviously simple if the two distributions rest upon identical 
classifications of the same variable. If this is not the case, 
comparison is more complicated. There would seem to be no 
material objection, however, to superimposing the cumulative 
curves with horizontal scales so adjusted as to occupy essentially 
the same section of the horizontal axis. The two scales may 
then be indicated one above the other. 

A variant of the cumulative graph has been given an important 
special application in comparisons of a particular kind — namely, 
comparisons of the concentration of wealth, or income. The 
graph is commonly referred to as the Lorenz curve, after its 
originator, Dr. M.O.Lorenz.!_ As originally employed, the curve 


1See M. O. Lorenz, ‘Methods of Measuring the Concentration of Wealth,” Quarterly 
Publication of the American Statistical Association, June, 1905, pp. 209-219. 
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showed cumulative percentages of the total wealth, horizontally, 
and cumulative percentages of the total population, vertically, 
the population cumulating from poorest to richest. Under this 
arrangement, the diagonal running from the lower right-hand 
corner to the upper left-hand corner of the grid represents a 
perfectly even distribution of wealth, or income: meaning by this 
a distribution in which one per cent of the population possesses 
one per cent of wealth; two per cent of the population, two per 
cent of the wealth; andsoon. Deviation of the cumulative curve 
from the diagonal therefore measures the actual concentration of 
wealth. Consider the following diagram (after Lorenz) : 


CHART 10. LORENZ CURVES CONTRASTING DISTRIBUTION OF WEALTH 
IN YEARS A AND B 


Cumulative 

Percentage 
of 

Population 


o 10 20 30 40 50 60 70 680 90 100 
Cumulative Percentage of Wealth 
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The proportion of the wealth possessed by any given proportion 
of the population is easily read from the figure. From a diagram 
of this sort comparisons of the concentration of wealth are readily 
made. For this particular case, no diagram serves as well as a 
Lorenz curve. 


CHART 11. LORENZ CURVE OF WAGE EARNERS AND WAGE RECEIPTS IN 
GIVEN WAGE-EARNING GROUP 


Cumulative 
Percentage 
‘of Workers 
100 


; ffl 
Ama | 


ON I0 "208 30% 407950 4 60°70 60 90 —100 
Cumulative Percentage of Wages 


The Lorenz curve is applicable in the analysis of any situation 
in which individuals are recipients or possessors of particular 
objects or benefits, such as wealth or income, or earnings. ‘Thus, 
the distribution of earnings among wage earners may be clearly 
shown by a Lorenz curve. Professor G. R. Davies presents the 
following case :1 


1See G. R. Davies, Introduction to Economic Statistics, pp. 39-49. 
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Data FOR LORENZ CURVE OF WAGE EARNERS AND WAGE 


Receipts IN GIVEN WaAGE-EARNING GROUP 


CUMULATIVE PERCENTAGE 
Urrer Limit oF WAGE CLASS 
NUMBER OF WORKERS WAGES RECEIVED 
$1.00 11.6 57) 
1.50 44.0 Bae 
2.00 93-9 89.0 
2.50 90.7 93-1 
3.00 98.3 96.1 
3.50 99.2 97.8 
4.00 99.4 98.5 
4.50 100.0 100.0 
CuaArtT 12. Lorenz Curves oF RENTS AND INCOMES IN SELECTED CITIES 


100, 


ve) 
ie) 


fe.) 
(2) 


Cumulative Per Cent of Rent or Income 


10) } 


20 30 40 50 60 7 


60 90 


Cumulative Per Cent of Families or Persons 


Source : 
tion of Rental 


W. C. Helmle, ‘‘ The Relation betw 


c een Rents and Incomes, and the Distribu- 
Values,” Bell System Technical Journal, vol. I, no. 2, pp. 82-109. 
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The same form of graphic analysis may be effectively used 
wherever the problem of equality of distribution is involved. 

A third illustration of the use of Lorenz curves is given in 
Chart 12.1 Here the curves bring out clearly the facts that 
“incomes are more unequally distributed than rents,” and that 
“the proportion of income spent for rent is less among the 
larger incomes.” The Lorenz curve serves admirably to expose 
disparities of this sort. 

Frequency distributions — both simple and cumulative — 
are among the most important instruments of statistical analysis. 
They will be examined at length in Part III.” 

1 Note that the position of the variables in this chart is the reverse of what it is in the 


others. There appears to be no standard rule governing this feature of the Lorenz curve. 
2 Two-way (correlative) distributions will be considered in Part IV. 


CHAPTER VII 


SPATIAL DISTRIBUTIONS 


Tue distributions considered in the two preceding chapters 
have been attributive. In these, classification of the variable 
is on the basis of differences of attribute; it has no reference to 
either time or place. Differences of time and space cannot, 
therefore, play any part in the subsequent analysis; attention 
has to be focused entirely on differences of attribute. 

In contrast to attributive distributions, another type rests 
fundamentally on observed differences in space. In this second 
type only such variation is analyzed as is directly connected with 
differences of location in the manifestations of the variable. 
Variation may then be regarded as a function of space. Dis- 
tributions which are set up on this basis may, therefore, be said 
to be spatial in character. 

The contrast between attributive and spatial distributions will 
become clearer if a concrete illustration is presented. Take, for 
example, the two distributions shown below: 


TABLE 26 A. DISTRIBUTION OF POPULATION OF 
KANSAS, JANUARY 1, 1920 


By Age 

Acr GRouP POPULATION 
Under 5 187,262 

5 tog 185,270 
to to 14 179,311 
15 to 19 162,691 
20 to 44 655,321 
45 and over 396,665 
Age unknown DOH 


Source: Fourteenth Census of the United States, vol. III, Pp. 344. 
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One of these distributions altogether ignores differences in the 
location of the observed individuals; the other makes geographic 
location of the individuals the very basis of classification.’ 

The nature of the spatial distribution is to be clearly under- 
stood. It is not to be confused with mere geographic classifica- 
tion. A presentation of the population of the United States by 
states, the states being alphabetically arranged, is a classification 
having geographic connotations; it is not a spatial distribution. 
From the point of view of spatial analysis, either an alphabetical 
or a magnitudinal arrangement is essentially disorderly. Only 
when the arrangement of classes is such as to bring out the 
character of the spatial relationships is the distribution to be 
considered truly spatial. 

Spatial relationships are of two general sorts: (1) linear rela- 
tionships, that is, relationships along a given line; (2) areal 
relationships, that is, relationships over a given area. An 
example of spatial relationships along.a line appears in Table 27. 


TaBLE 27. NuMBER OF Prrsons LivinG WITHIN FivE Minutes’ WALK OF 
SUCCESSIVE STOPS ON SUBURBAN TROLLEY LINE 


DisTANCE FROM TERMINAL NuMBER OF Persons LIvING WITHIN 
(in miles) Five Minutes’ WALK oF Srop 
0.3 _ 850 
0.7 57° 
igh 310 
Tels 225 
2.0 125 
GC) 75 
4.0 50 
5.0 tere) 

5-5 380 
6.0 680 


1 While the difference between the two distributions is fundamental in character, it is 
not as far-reaching as that between either attributive or spatial distributions and the third 
type, namely, temporal distributions. Attributive and spatial distributions alike bear no 
reference to the passage of time; they are as of a given time. In other words, differences in 
time in the observations of the variable are held to have no significance. It is possible, then, 
to look upon spatial distributions as essentially like attributive distributions, the frequencies 
in both forms of distribution being held to apply to specially designated groups under con- 
ditions which are static as far as the distributions are concerned. But in a spatial distribu- 
tion the geographic connotations of the several classes give the distribution certain qualities 
which may appropriately be made the subject of special analysis. It is the purpose of this 
chapter to indicate those phases of method which are especially related to the formation 
and graphic representation of distributions essentially spatial in character, 


4 
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Spatial distributions of the areal type are more common and have 
already been illustrated (see Table 26B). Rules governing the 
seriation of the variable are essentially the same for both forms. 
Since linear distributions present practically no complications, 
discussion will be confined to distributions designed to show 
spatial relationships over areas. 

Formulation of spatial distributions involves essentially the 
same steps that have already been examined in Chapter VI in 
the study of frequency distributions. The problem in setting 
up a spatial distribution is to classify the individuals of some 
larger group so as to indicate the fashion in which they are 
spread out geographically. This clearly entails the choice of 
class intervals (or more exactly in this case, class areas), the 
location of class limits, and the selection of class designations. 
Let us consider these matters in turn. 

Determination of the class area in the case of spatial distribu- 
tions should recognize the same governing principles as were 
set forth in the discussion of the class interval in frequency 
distributions. The underlying object is to discover that area 
which will bring out the most significant characteristics of the 
distribution. Areas should not be so small as to labor the record 
with an excess of detail; neither should they be so large as to 
conceal important features of the distribution. Furthermore, 
it is of the utmost importance that the areas be equal; that dif- 
ferences in the number of cases in the various classes may relate 
solely to the variable density of the items, and not at all to dif- 
ferences in the sizes of the areas with which the items are asso- 
ciated. These general principles are to be carefully recognized 
in the formulation of spatial, as of all statistical, distributions. 

Unfortunately, the choice of class area in spatial distributions 
is commonly determined independently of the dictates of satis- 
factory statistical analysis. Thus, class areas are fixed in many 
instances in the organization of administrative machinery for the 

1 Areas nominally equal are sometimes not really so. Thus, several counties may be equal 
in size, but one of them, because it contains a large body of water, may possess a much 


smaller inhabitable area. Differences of this kind may sometimes call for adjustment in the 
classes, or appropriate recognition in the interpretation of the data, 
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collection of the data. A customs office may be so set up that, in 
the reporting of statistics of imports, details can be conveniently 
reported for customs districts only, and not for individual ports. 
This, of course, will make it impossible subsequently to analyze 
the geographic distribution of imports by any smaller, or more 
equal, geographic subdivisions than customs districts. 

Class areas may be settled in other cases, not so much by the 
administrative organization of the agencies conducting the 
inquiry, as by the purposes for which the investigation is under- 
taken. Thus, in the decennial population census in this century, 
the original and fundamental purpose is political. The Federal 
constitution provides that once every ten years a census of the 
inhabitants shall be taken to determine the apportionment of 
representatives in Congress and of any direct taxes. To serve this 
purpose, returns must be organized on a definite political basis, 
i.e. the number of inhabitants must be reported for each political 
subdivision. As a natural consequence of this purpose, the 
population census is set up to yield population figures by political 
divisions; states; counties; congressional districts; townships ; 
municipalities ; wards. 

It is clear enough that class areas determined on the basis of 
such considerations as these may prove altogether unsatisfactory 
from the point of view of spatial analysis. The areas may bear no 
significant relationship to geographic differences in the variable. 
The boundaries of political subdivisions, for example, are statis- 
tically deficient, not only because they describe unequal areas, 
but also because they commonly cut athwart and subdivide 
irregularly the areas which, from a statistical point of view, are 
strictly homogeneous; or include in the same division areas 
representing, from a statistical pointof view, widely diversified 
conditions. The difficulties along this line are greatly reduced, 
of course, when political boundaries are laid out on some regular 
scheme, as are the counties of the middle western states. At 
best, however, the deficiencies of political and other administra- 
tive divisions from the point of view of spatial analysis are serious. 

Only one effective remedy for these difficulties offers at present. 
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Decided improvement may be had at times by encouraging the 
reporting of original items by smaller subdivisions. Social 
workers in some of our larger cities have succeeded in having 
certain statistical material reported by individual city blocks. 
This can be done at any time when the authorities are willing to 
cover the additional expense involved in the reporting and 
tabulating of cases for smaller areas. Undoubtedly, much will 
be accomplished in time by so reorganizing statistical investiga- 
tions as to yield data for such smaller equal areas. 

At the same time, some of the difficulties along this line are 
bound to continue. They are to be met not so much by attempts 
to obtain a reorganization of the methods of collection as by 
adjustments in the subsequent analysis of the material. Through 
the reduction of the original items to relatives of one kind or 
another, and through the adoption of appropriate means of 
graphic representation, the difficulties encountered in unequal 
class areas may be in some measure offset. It should be recog- 
nized, however, that the technique of analyzing and presenting 
the spatial variation of business and economic phenomena will 
always be beset by difficulties arising from the inequality of class 
intervals.! 

Aside from the difficulty of setting up equal class intervals in 
spatial distributions, seriation of the variable involves no serious 
problems. The limits of the several classes and the designations 
to be applied will ordinarily be clearly prescribed by the character 
of the areas for which the data are given.” 

Effective presentation of spatial distributions requires special 
consideration. It needs only a moment’s thought to bring out 


1Tn connection with the subject of class areas, it should be noted that it is quite possible 
to make the classes cumulative. The determination of the center of population in the United 
States involves the use of such cumulative class areas. In general, however, the principle 
of cumulation is not often applied to spatial distributions. 

2 If the distribution relates to areas not generally familiar, it may be desirable to con- 
struct a key to the class designations. A simple scheme serving this purpose is commonly 
employed on maps in geographies. By lines drawn at right angles the map is first divided 
into squares or rectangles. The vertical subdivisions are then designated by letters and the 
horizontal, by numbers. Finally, an alphabetical list of the places to be located is provided, 
each place being given the letter-number corresponding to its position on the map. Some 
such key may prove necessary in the effective use of a statistical map. 
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the fact that any columnar arrangement of items leaves much to 
be desired. Columns run along a line; most spatial relation- 
ships spread out over anarea. Tabular and textual presentation 
of spatial distributions, except those which are linear, is defec- 
tive on this account. Nevertheless, tables have to be used in 
presenting spatial relationships, and it is important to see what 
can be accomplished in tabular form. 

In the first place, if the relationships happen to be exactly or 
approximately rectilinear, the arrangements of items in the table 
may be made to correspond. ‘This was done in the construction 
of Table 26B, which is reasonably successful in bringing out the 
spatial relationships of population in the different Kansas 
counties. Unfortunately, this relatively simple geographic form 
is not of common occurrence. 

In the second place, the development and common recognition 
of standard arrangements of items may permit a rough repre- 
sentation of areal relationships in a column of figures. Thus, 
there is the practice long followed in the work of the United States 
Bureau of the Census of listing the states in the order of their 


CHART 13. POPULATION OF THE UNITED STATES, JANUARY I, 1920, BY STATES 


(in millions) 


locations from north to south and from east to west. 
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standard arrangement is shown in the following list : 


The North The South The West 
New England: South Atlantic: Mountain : 
Maine Delaware Montana 
New Hampshire Maryland Idaho 
Vermont District of Columbia Wyoming 
Massachusetts Virginia Colorado 
Rhode Island West Virginia New Mexico 
Connecticut North Carolina Arizona 
South Carolina Utah 
Middle Atlantic: Georgia Nevada 
New York Florida 
New Jersey Pacific : 
Pennsylvania East South Central : Washington 
Kentucky Oregon 
East North Central: Tennessee California 
Ohio Alabama 
Indiana Mississippi 
Tllinois 
Michigan West South Central: 
Wisconsin Arkansas 
Louisiana 
West North Central : Oklahoma 
Minnesota Texas 
Towa 
Missouri 


North Dakota 
South Dakota 
Nebraska 
Kansas 


If such a standard arrangement is generally recognized, or if the 
geographic connotations of the different designations in some way 
can be made self-evident, a tabular presentation may serve to 
bring out in some measure the spatial relationships of the several 
classes. 

No regular tabular presentation of spatial distributions, how- 
ever, is really adequate. Some sort of graphic exhibit is neces- 
sary to convey an accurate picture of the situation. The simplest 
possible form of such an exhibit would seem to be an outline map 
in which the numerical data have been entered. Chart 13 
affords an illustration. 
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Such an exhibit, rudimentary as it is, has the obvious merit 
of presenting the items in their proper relationship to one 
another. 

No form of presentation of spatial distributions, however, 
begins to be as effective as the statistical map. In order to 


CHart 14. NUMBER OF HorsES ON Farms, APRIL 15, I9I0, BY STATES 
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understand the real significance of an item in a spatial distribu- 
tion, the items on all sides must be carefully examined. Relation- 
ships must be read over an area instead of along a line. To 
accomplish this purpose, no textual or tabular exposition is 
adequate. There is a fundamental reason for the extensive use 
of maps in the subject of geography,! and the same reason holds 
for map presentation of spatial variations in economic and busi- 
ness phenomena. 

Spatial distributions are best represented graphically by 
mulitple dot maps. Such maps are constructed on the principle 
of allowing each dot to represent a stated number of cases and 
distributing over the appropriate areas on the map the number 


'Tn fact, it may be argued that spatial distributions by nature belong to geographic 


science. Certainly much is to be said for the general contention that geographic science 
should be extended to include all aspects of spatial variation. 
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of dots necessary to represent the cases in each division.1. Charts 
14 and 15 are examples. 

The multiple dot map is an exceptionally effective form of 
graphic representation. It shows the space relationships as 
these can only be exhibited in a map, and in addition, has the 


CuHart 15. NuMBER OF SHEEP ON FarMs, APRIL 15, 1910, BY STATES 
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Source: Statistical Atlas of the United States, 1914. 


virtue of employing areas which correspond to the sizes of the 
classes on which the distribution has been set up. Unequal class 
areas in the distribution are matched exactly by unequal areas in 
the map. The multiple dot map thus makes an adjustment 
which eradicates much of the confusion incident to the unequal 
class intervals which commonly characterize spatial distributions. 

Two varieties of multiple dot maps are to be recognized. In 
the first, each dot represents a large number of items, and a 

1 Maps of this description are not to be confused with single dot or circle maps. In the 
latter, solid black circles, whose areas correspond to the number of items, are placed in the 
several different areas recognized in the distribution. Maps of this sort, though rather 
effective from a pictorial point of view, are unsatisfactory because of the impossibility of 
reading accurately the areas of circles. Furthermore, this graphic form does not take directly 
into account the varying size of the areas upon which the circles are plotted or placed. 


Some adaptation of the graphic form to the nature of the distribution is of great importance 
in the case of spatial distributions, which almost invariably show unequal class areas. 
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relatively small number of dots are necessary to represent the 
total number of cases in any individual class. This type of map 
has been commonly employed by the Bureau of the Census, and 
is illustrated in Chart 14. The only technical difficulty in mak- 
ing this type of map arises from the necessity of obtaining an 
even distribution of the dots used to represent the cases in each 
division. On areas of irregular shape, an even spacing of 
the dots is not always easily accomplished. In general, 
however, construction of this form of map presents no material 
difficulties. It is a reasonably satisfactory form, though its 
effectiveness falls conspicuously short of the second type of mul- 
tiple dot map. 

In this second type, each dot is made to represent a smaller 
proportion of the total number of cases, with the result that a 
large number of the dots have to be plotted.t. The result is a 
much more accurate picture of the actual distribution of the 
cases. Charts 16 and 17 serve to illustrate the form. Maps 
of this description are used increasingly and are exceedingly 
effective in showing spatial distributions. 

Pin-maps are essentially dot maps so constructed that the dots 
may be removed or relocated. They are constructed by sticking 
specially prepared pins—-made usually with short shanks and 
round glass heads of different colors—into outline maps mounted 
on thick cardboard or cork.? Such pin-maps are commonly em- 
ployed when the exact location of the individual items is known. 
The pins are then placed as nearly as possible on the exact loca- 
tions. Pin-maps have been used very effectively in tracing the 
incidence of epidemics and may be used profitably in following 
the location of cases of all sorts when it is desirable to trace the 
changing situation continuously, rather than merely to catch its 
image at a single moment of time.. 

Spatial distributions are not as common in economic and 
business analyses as either frequency or temporal distributions. 


d iy The practice is ordinarily to make the map on a large scale and subsequently to reduce 
it in the reproduction. 

2 The technique of constructing these maps is excellently explained in Willard C. Brinton’s 
Graphic Methods for Presenting Facts, chap. XII. 
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In general, there are two reasons for this. In the first place, a 
great many of the most significant spatial distributions have 
been appropriated by the science of geography and have con- 
sequently not been so definitely studied in economic investiga- 
tions.’ In the second place, spatial distributions have not been 
so extensively developed in economic research because, as we 
have seen, there are serious difficulties in obtaining the data in 
satisfactory form. Proper seriation of the spatial variable 
encounters more serious obstacles than seriation of either of the 
other fundamental varieties of variable. Spatial distributions 
are important, nevertheless, and should be more carefully studied 
than has been the custom. 


1 The development of economic geography promises in time to develop the economic 
and industrial aspects of many of these spatial distributions. 


CHAPTER VIII 
TEMPORAL. DISTRIBUTIONS 


Tue distributions considered thus far have fallen into two 
fundamental classes: (1) those unrelated to either time or space ; 
and (2) those related specifically to space. A third type has now 
to be examined; namely, the type in which the distribution is 
definitely related to time. Distributions of this sort abound in 
the records of social and business science. They represent some 
of the most interesting material available for statistical analysis. 
They may be considered under the two heads of the (1) simple, 
and (2) cumulative, forms. 


A. StmpLtE FormM 


The general character of simple temporal distributions is easily 
understood. It will be recalled that original observations note 
the existence, or the variable size, of attributes or subjects falling 
within specified limits of time and space. But observations may 
be made at different times. Furthermore, if the differences of 
time are significant and are carefully recorded, they may be 
made the primary basis of analysis. A temporal distribution 
of the variable then results. Such a distribution subdivides an 
aggregate relating to a longer period of time into smaller aggre- 
gates pertaining to shorter periods within the longer. Thus, the 
production of pig iron in 1923, instead of being given as a total 
for the year, may be distributed in the form of twelve items ex- 
pressing the production of pig iron month by month through the 
year. The character of such a distribution is self-evident. 

The processes involved in the formulation of a simple temporal 
distribution are essentially those already recognized in the cases 
of attributive and spatial distributions. These processes have 
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been explained at length in the preceding chapters. It may safely 
be stated, however, that fewer difficulties are encountered in 
setting up the temporal distribution than in formulating either 
of the other two types. Why this is so will appear if the steps 
that have to be taken in formulating the temporal distribution 
are considered one by one. 

Selection of the interval in a temporal distribution is relatively 
simple. Divisions of time which are natural, and for the most 
part regular, have been recognized in the original observations. 
These original observations may have been made periodically 
as of specified dates or stated intervals; or they may have been 
made continuously with specific reference to the days, weeks, 
months, or years. The organization of the inquiry ordinarily 
determines the intervals of time which may be used subsequently 
in the analysis. But the important point is that the intervals 
commonly employed in reporting the variable are satisfactory 
from the point of view of analysis. Essentially equal intervals 
are taken from the calendar as a matter of course. The decade, 
the year, the month, the week, the day, the hour — these and 
multiples of these are the customary divisions of time in temporal 
distributions. 

Determination of the most satisfactory interval — where con- 
ditions permit of a choice —is by no means as simple. Some- 
times the nature of the variable settles the matter once for all. 
Thus, data on agricultural production naturally conform to the 
crop period. This is ordinarily the year. The farmer makes his 
commitments in sowing his crop. Once the crop is under way 
the farmer is almost certain to see it through to the harvest. 
Agricultural results are to be thought of largely in terms of the 
crop-year. While it is important to know what the crop of each 
year is, there is ordinarily little interest in further details as to 
the precise period of its appearance. The most significant 
interval for the time series in this instance is set by the nature 
of the variable.’ 


1 Mineral or manufacturing production, in contrast to agricultural, may vary widely 
from one month to another. It may be important, therefore, to know the output of the 
mines and of the factories by months. 
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More commonly, the question of the most satisfactory interval 
is determined by the objectives of inquiry; it is a question of 
what measure of detail gives most significant results. In some 
analyses, annual data afford all the information required. Occa- 
sionally, even longer intervals are satisfactory. Thus, a record 
of gold production by quinquennial periods may suffice in an 
investigation of long-time relationships between prices and the 
quantity of gold produced. The practice is growing, however, 
of giving time data on the basis of shorter intervals. Some of 
the most interesting types of fluctuation among social and 
business phenomena are almost entirely concealed in annual 
totals. Quarterly, monthly, and weekly records serve much 
better. These more detailed records have the merit, for instance, 
of disclosing seasonal variations. These and other shorter- 
time movements may possess marked significance. Series of 
daily items are not commonly desirable: they tend to be more 
erratic and are unnecessarily detailed. Yearly aggregates often 
err in the opposite direction. For many analyses, monthly items 
seem to effect a satisfactory compromise. They exhibit clearly 
and without excessive detail the more important movements 
shown by time series. 

The problem of unequal intervals — so troublesome in the 
cases of attributive and spatial distributions— though not nearly 
as serious in the formation of time series, is not entirely absent. 
The months of the years are obviously of variable length, even 
when the calendar divisions alone are taken. When the variable 
_ number of Sundays is considered, one month for business purposes 
may be much shorter or longer than another. Thus, February 
contains four Sundays. Ignoring holidays and leap years, this 
leaves but twenty-four business days in that month. March 
may include only four Sundays, in which case it will show 
twenty-seven business days, or 12.5 per cent more than February. 
Similarly, all days may be twenty-four hours in length; yet from 
the point of view of the hours of employment, they may vary 
greatly in the same plant. Working weeks may vary from forty 
to sixty hours. The apparently equal census decades shown in a 
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preceding table are not of exactly equal length, for the count in 
earlier censuses was taken on July 1, in 1910 on April 15, and in 
1920 on January 1. Such differences, slight as they appear, may 
cause substantial differences in the variable; they confuse the 
series in which they occur.! In general, differences in the real 
interval in time series are to be carefully examined. If they prove 
to be of serious import, they are to be eliminated, or in some 
way allowed for, in the analysis and interpretation of the 
data. 

An illustration of how this can be accomplished may prove 
instructive. Take, for example, the record of monthly pro- 
duction of steel ingots in the United States from January, 1914, 
to December, 1921. The fluctuations of this series reflect for the 
most part real changes of industrial pace. But in some measure 
the fluctuations are due to the fact that the data are based upon 
months of varying length. Fluctuations on account of this cause 
are merely nominal from most points of view. They may be 
eliminated by translating the series into one showing average 
daily steel-ingot production by months. The two series are given 
in Table 28, together with the number of working days in each 
month. Such translation of time series is frequently desirable 
when the original intervals prove upon examination to be sub- 
stantially unequal. 

Class limits in time series are ordinarily self-evident. Any 
particular year, or month, or day is objectively and rigidly 
defined. In the use of special periods, however, the class limits 
are not always as clear. Fiscal years may begin and end at any 
time in the calendar year. The “‘crop-year”’ for cotton contains 
one run of twelve months; the “clip-year” for wool quite 
another. The United States Census period may vary in its limits 
from time to time and from purpose to purpose. In general, the 
class limits are to be specifically indicated unless they conform 
to the calendar, which is to be assumed in the absence of any 
statement to the contrary. 


1 For interesting illustrations, see the returns on live stock in Thirteenth Census of the 
United States, 
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TABLE 28. Monruty Steer Incor Ourput IN THE UNITED STATES 
JANUARY, 1920, TO JUNE, 1922 (in gross tons) 
ToraL Ourpur ror | No. or Workinc | AVERAGE OUTPUT 

StONaE Monta Days In Monte PER Day 

1920 
January 3524,025 27 130,519 
February 3,401,759 2 141,740 
March . 3,910,958 27 145,073 
April 3,132,409 26 120,480 
May 3,423,178 26 131,061 
June 31538,970 26 136,114 
July . 35327,783 26 127,992 
August . 3,502,410 26 137,016 
September 3,561,304 26 136,976 
October 3,580,873 26 137,720 
November 3,132,890 26 120,496 
December . 2,778,713 26 100,874 

1921 
January 2,517,048 26 96,810 
February 1,998,704 24 83,279 
March . 1,794,777 27 66,473 
April 1,386,897 26 535342 
May 1,446,181 26 55,022 
June 1,146,349 26 44,090 
July . 917,823 25 36,713 
August . 1,300,199 Dg 48,156 
September 1,342,092 26 51,619 
October 1,847,138 26 71,044 
November 1,896,482 26 72,042 
December . 1,630,394 26 62,707 

1922 
January 1,820,487 26 70,019 
February 1,993,016 24 83,067 
March . 2,708,484 27 100,314 
April 2,792,755 25 III,71I0 
May 3,097,366 27 114,717 
June 3,009,800 26 115,761 


Source: Iron Age, July 13, 1922, p. 72. 


Class designations in time series usually involve no serious 
questions. All that need be indicated is that the items refer to 
specified days, or months, or years, or other obvious divisions of 
time. This is true whether the series be in simple or cumulative 


form, 
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Graphic representation of temporal distributions conforms to 
definitely recognized standard practices. The diagrams are 
constructed on rectangular axes with the independent variable — 
time — scaled along the horizontal or x-axis, and the dependent 
variable along the vertical or y-axis. The vertical scale is run 
from bottom to top, the horizontal scale from left to right. 
The general scheme of construction is illustrated in Chart 18. 


CHART 18. MovEMENTS OF FEDERAL RESERVE NOTES AND DEPOSITS OF 


FEDERAL RESERVE BANKS, 1917-1923 
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Both line and vertical bar forms are to be regarded as con- 
sistent with standard practice, provided they conform to the 


1 The fact that time is laid off so that it runs from left to right constitutes one of the 
reasons why the Census practice of putting later years in the left-hand side of a table is 
somewhat unfortunate. The Census arrangement reverses our long-established notions of 
the position of items in time. The Census does not arrange its diagrams similarly. There 
would seem to be a strong presumption in favor of adopting the same underlying scheme for 
corresponding tables and diagrams, 
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principles just stated.1. As a matter of fact, the bar diagram 
may be regarded as essentially analogous to the corresponding 
line form. Certainly there seems to be no logical ground for 
recommending one form to the exclusion of the other. Some- 
times the two may be effectively combined as in Chart 19. 
Perhaps in the representation of components the bar form is 
somewhat more satisfactory than the line chart. In general, 


Cart 19. Gorp MovEMENTS AND FEDERAL RESERVE BANK RESERVES, 
IQIQ~1923 
(in millions of dollars) 
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Bars above base line represent net imports: bars below base line, net exports. 


Source: Tenth Annual Report of the Federal Reserve Board, p. 17. 


however, the line diagram is somewhat easier to construct, is 
thoroughly satisfactory in appearance, is generally recognized and 
easily read, and may be regarded as the type of diagram to em- 
ploy unless there are reasons for adopting some other. 


1 Horizontal bar diagrams for time series are in doubtful standing. They are, however, 
sometimes employed by reputable offices, 
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Whether line or bar form is adopted, one general rule governing 
the scale of the dependent variable is to be carefully recognized : 
there is a strong presumption in favor of locating the zero of the 
vertical scale at the foot of the vertical axis, and there should be 
clear and convincing reasons for the adoption of any arrangement 
which does not place the zero of the vertical scale in this position. 
The reasons for this rule are not difficult to find. In reading 
any plot of a time series we unconsciously assume that varying 
distances above the horizontal axis are proportionate to dif- 
ferences in the magnitude of the variable. This assumption is 
warranted only if the scale of the variable included in the chart 
runs down to zero. While there are circumstances under which 
it is unnecessary to include the zero of the vertical scale in the 
diagram, these circumstances are somewhat exceptional.’ In 
general, particularly when series are being compared, it is of the 
utmost importance, in the interest of accurate graphic repre- 
sentation, that the zero of the vertical scale should appear at the » 
foot of the vertical axis. 

In the location of the scale of time along the horizontal axis, 
two distinct practices are followed. One practice plots points 
on the vertical lines of the grid; the other plots points between 
these vertical lines —in other words, over the subdivisions 
marked off by the vertical lines. Neither practice would seem to 
have the preponderance of authority. The first practice has the 
virtue of convenience since, with the vertical lines serving as 
guides, the plots can be spaced accurately without the slightest 
difficulty. The second practice, on the other hand, is sometimes 
explicitly recommended on the ground that it is desirable to 
think of the scale of time along the horizontal axis as continuous, 
and this idea is conveyed most clearly by regarding the sub- 
divisions of the horizontal axis as corresponding to definite periods 
of time. Upon the whole, the arguments for plotting over the 
intervals, rather than on the lines, would seem to prevail. As 
will be shown later, in the plotting of some time series it is im- 


1Save when the logarithmic scale is employed —a case which will be dealt with at 
length later. 
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portant to think of the horizontal scale as continuous. It 
somewhat simplifies the graphic practice, therefore, to regard the 
horizontal scale as continuous even when the character of the 
variable does not clearly require this conception. Unless there 
are definite reasons for not doing so, therefore, the items of time 
series may well be plotted in line charts over the middle of the 
subdivisions marked off by the vertical lines of the grid. 


CHART 20. VALUE OF ORDERS RECEIVED ANNUALLY BY GENERAL ELECTRIC 
CoMPANY, 1893-1922 
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Source: Thirty-year Review of the General Electric Company, 1892-1922; 
A letter to stockholders of the General Electric Co., Schenectady, N. Y., July 16, 1923. 


In the line chart commonly used to represent the several aggre- 
gates appearing in a simple temporal distribution, lines are 
ordinarily drawn to connect the successive plots. It should be 
recognized that these connecting lines are inserted merely to 
guide the eye from one plot to another and to give the diagram 
as a whole visual effectiveness; they do not correspond to any 
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clearly assignable intermediate values of the variable. This fol- 
lows from the fact that the total for the longer period for which 
the variable is given is completely distributed in the several 
smaller aggregates actually cited in the distribution. The 
variable has no values in between those stated. 


B. CUMULATIVE FORM 


The temporal distributions considered thus far have all been 
of the simple type; the numbers reported for each interval have 
been exclusive of the numbers reported for all the other intervals. 
It is possible, however, in temporal — as in attributive — dis- 
tributions, to work with cumulative figures. 

The relationship between the cumulative and simple forms in 
the case of temporal distributions is like that already explained 
in the case of frequency distributions: each successive interval 
is inclusive of the intervals that have gone before. The relation- 
ship between the simple and cumulative forms may be made clear 
through an illustrative case. The two forms are presented for 
monthly Portland-cement production in 1923, in the table on 
the following page. 

Cumulative temporal distributions differ from the corre- 
sponding cumulative frequency form in that they permit of sub- 
tractions as well as additions — decrements as well as increments 
—from one period to another. An illustration will serve to make 
the matter clear. Take the total net movement of gold into the 
United States from 1915 to 1923. This can be given in monthly 
items of net import and export from January, 1915, to December, 
1923. (See Chart 19.) The cumulative results of these are easily 
calculated by mere addition and subtraction of the monthly items. 

Practically no difficulties arise in the formulation of such cumu- 
lative temporal distributions if they are derived directly from 
corresponding simple distributions which have been previously 
set up in proper form. The only question that may arise in 

1 This may be regarded as a reason for adopting the vertical bar form rather than the line 


form for simple temporary distributions. An interesting illustration of the bar form is 
afforded by the chart on the opposite page. 
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TABLE 29. MonrHLy PRopucrion oF PORTLAND CEMENT IN 
THE UNITED STATES IN 1923 
(in thousands of barrels) 


PRODUCTION 
Mont 
ow raw worm | CMR ae eta 
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PATO Ti eerie ty cont rsitio esha, ee 11359 37439 
IEG ge) a Gh ASR Rt age Wer ee 12910 50349 
UMC Rela Eunice ues ie cs 12382 62731 
Ulva Mee cee el a on ee 12620 75351 
PAUP US tte testi a ae) Batis gal, te 12967 88318 
Saoomayse 2 5 ak 5 ie 6 13109 101427 
Octoberiaarin ae.) ee te 13350 114777 
WNovembety cue. ats) stakes 12603 127380 
Deemer 4 Ain 8 6 6 0 9997 137377 


Source: “‘ Statistical Record,” Review of Economic Statistics, July, 1924, prel. vol. VI, 
suppl. 1, p. 132. 
making the translation from one form to the other relates to the 
class designations. Care should be taken to specify exactly in 
the cumulative distribution the period through which the 
variable cumulates. Ordinarily this may be done satisfactorily 
by using such a form as ‘Week ended January 18, 1921,” or 
“‘Crop-year ended August 31, 1921.”! Designations in this style 
leave no doubt as to the period to which the items relate. In 
general, reasonable care in formulating the designations will 
make clear practically any cumulative temporal distribution. 

The graphic representation of cumulative temporal distribu- 
tions observes the governing principles already noted in the 
discussion of simple distributions. Only one or two additional 
points need be recognized. In the first place, it is obvious that 
cumulative distributions may be regarded as essentially continu- 
ous; that is, intermediate values may be assumed to exist in 
the cumulation of the total, though cumulation is actually 
recorded only at the end of stated intervals. Thus, if cumulative 


1“Ended” is better than “ending.” 
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coal production from the first of January, 1923, to the last of April 
of that year is 181,700,000 short tons and to the last of May is 
227,780,000 short tons, it is not illogical to think of a figure half- 
way between these two as in some measure pertaining to the 
middle of May. On this ground there is reason for definitely 
preferring a line form rather than a vertical bar form of diagram 
in the representation of such distributions. 


CHART 21. CUMULATIVE MONTHLY PRODUCTION OF PORTLAND CEMENT IN 
THE UNITED STATES IN 1923 
Million Barrels 


Jan. Feb. Mar. Apr. May June July Aug. Sept. Oct. Nov. Dec. 


In the second place, in the construction of graphs to represent 
cumulative temporal distributions, care should be taken to relate 
the plots accurately to the horizontal scale. Cumulation, as 
stated over a given period, is to be thought of as effective only 
at the close of this period. Consequently the plots should be 
over those positions of the horizontal scale which correspond to 
the closing of the several intervals, not over the middle of the 
intervals as in the diagrams of simple distributions. 


116 STATISTICAL ANALYSIS 


With these matters in mind, as well as the general principles 
governing the graphic representation of time series, no difficulty 
should be experienced in giving satisfactory graphic expression 
to a cumulative temporal distribution. The general type of 
diagram is well represented in Chart 21. 

Cumulative temporal distributions are very effectively used 
in tracing the course of many business factors, particularly when 
it is possible to set up a definite schedule against which progress 
or accomplishment may be measured. In Table 30 we have an 
illustration of this use of the series. When thrown into graphic 
form — as in Chart 22 — such an analysis may be an extremely 
effective check on actual performance." 


CHART 22. CUMULATIVE SCHEDULED AND AcTUAL MONTHLY PRODUCTION 
OF AIRCRAFT, 1918 


No. of 


600 


400 


200 


Feb. Mar Apr. May June July Auq Sept. Oct. 


i An alternative graphic form for comparing cumulative totals is the Gantt Chart. 
This is widely used in business enterprises. For an explanation of the methods of con- 
structing Gantt Charts as well as of their many advantages, see Wallace Clark, The Gantt 
Chart ; also Walter N. Polakoy, “Kinetic Statistics as an Aid to Production and Distribu- 
tion,” Journal of the American Statistical Association, Sept., 1922, Pp. 359-365. 
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TABLE 30. CUMULATIVE SCHEDULED AND AcTuAL Monrtuty PRopUCTION 
oF AIRCRAFT, 1918 


At Enp oF SCHEDULED ACTUAL rere 
January wo. tote — 
Hebruatyse., < 2. os 100 15 85 
WWE gel eh SS, wen Ney GA Ck 200 40 160 
ISTHE Ge Newor ity, ate 400 150 250 
Mays seu 5 a. 600 350 250 
NUN) Cie. se eo eer Ce 800 400 340 
uilye@e ae as gee Get eae 1000 705 295 
PATIUSE Sem cot ty eis 1200 862 338 
EpLEM DEL ae ia) ye ae 1400 940 460 
October. gets ies) aber 1600 25 475 


Simple and cumulative temporal distributions are among the 
most interesting and important forms of statistical material. 
Since they show relationships between time and an observed 
dependent variable, they fall unmistakably within the category 
of statistical series. Together with other time series — many of 
which are not distributions — they will be carefully analyzed 
in a later section of this study. 


ANALYSIS OF FREQUENCY 
DISTRIBUTIONS 


CHAPTER IX 
TYPES OF FREQUENCY DISTRIBUTION 


Ir a large number of simple frequency distributions are exam- 
ined, a variety of forms will be discovered. These forms will be 
distinguished by differences in the number and location of the 
points of concentration of items, and by the extent to which 
the items scatter. Again some distributions will show but one 
peak, others two or more. The maximum frequency will occur 
in some distributions at the middle of the range of variation; in 
others, toward one side or the other. The range of variation 
will run out indefinitely in some distributions; in others it will 
end abruptly at a given point. These differences raise the ques- 
tion: from what points of view are frequency distributions to 
be most significantly characterized, or classified ? 

The most important basis of classification relates to the position 
of the class of maximum frequency in the range of the variable. 
From this point of view, four distinct types are encountered :! 
Symmetrical . 

Moderately asymmetrical ‘ 
Extremely asymmetrical 
“J-shaped” 


The distinctions among these types rest upon differences of 
degree, but are nevertheless unmistakable. Their nature will 
be evident if concrete illustrations are considered. 

In the symmetrical or bell-shaped distribution, the point of 
maximum frequency is at the middle of the distribution, and the 


1A fifth type, referred to as “U-shaped,” is rare. This type shows two points of concen- 
tration, one at either end of the distribution. Yule (Introduction to the Theory of Statis- 
lics, pp. 103-104) cites variation in the degree of cloudiness as an illustration. The type 
is so exceptional as to require no further discussion here. 
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frequencies become lower and lower as the classes are farther 
and farther removed from the center of the distribution. The 
type may be illustrated by data on the height of the members of 
a class of men just entering college. The distribution is shown 
numerically in Table 31 and graphically in Chart 23. 


FREQUENCY DISTRIBUTION OF HEIGHT AMONG MEMBERS OF FRESHMAN CLASS 
IN 1913 IN OHIO STATE UNIVERSITY 


TABLE 31 CHART 23 
foe NuMBER OF MEN No. of 
Men 

All 750 

61 2 

62 Io 

63 nae 

64 38 

65 57 

66 93 

67 106 

68 126 

69 109 

70 87 

7 75 

- a %O 6264 6668.10 1274 
os y Height (in inches) 


Source: C. J. West, Introduction to Mathematical Statistics, p. 24. 


The highest frequency clearly is at 68 inches; about one-sixth 
of the total number of men are approximately of this height. 
On both sides of this largest group, the frequencies fall off, the 
rate of decline being about the same both above and below the 
point of greatest concentration. A distribution such as this is 
clearly symmetrical. 

Two other illustrations from quite different fields may prove 
instructive. A very close approximation to the perfectly sym- 
metrical distribution is shown by the records giving the per- 
centage of freight traffic to total traffic on American railroads, 
and by the records of the ratio of assessed to true valuation of 
property in 119 counties of Kentucky. The character of the 


120 


STATISTICAL ANALYSIS 


distributions in these cases is shown in Tables 32 and 33 and 
Charts 24 and 25: 


FREQUENCY DISTRIBUTION OF PERCENTAGE OF FREIGHT TRAFFIC TO TOTAL 


TRAFFIC ON 131 AMERICAN RAILROADS IN 1920 


TABLE 32 
MIp-VALUE OF 
Dyes ee aee 

CLass 
All 131 
45 
5° ° 
55 3 
60 6 
65 I2 
70 23) 
75 3! 
80 18 
85 18 
90 10 
95 6 


Number o1 
Railroads 


40 


35 


30 


25 


20 


40 


Source: Moody's Manual, 1921. 


CHART 24 


LY 


= 


50 60 70 


80 


90 100 


Percentage of Freight Traffic 


FREQUENCY DISTRIBUTION OF RATIO OF ASSESSED TO TRUE VALUE OF PROPERTY 
IN 119 COUNTIES OF KENTUCKY 


TABLE 33 


RATIO oF ASSESSED 


NUMBER OF 


to TRUE VALUE. CouNTIES 

All IQ 
26-30 I 
31-35 6 
36-40 7 
41-45 18 
46-50 19 
51-55 23 
56-60 21 
61-65 II 
66-70 3 
71-75 6 
76-80 4 


CHART 25 


Number of 
Counties 


24 
22 


8 
6 
4 
(4 
2 


ais 


S 30 35 40 45 50 55 60 65 70 75 80 85 


Ratio 


Source: O. O. Jensen, Public Finance, p. 254. 
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The symmetrical distribution possesses great theoretical 
significance. This is because it is exhibited by the returns ob- 
tained under simple conditions of chance. Suppose, for example, 
that an astronomer is making repeated observations on a fixed 
star. After making all corrections for the known disturbances 
of the observations, he will find his determinations varying some- 
what; in spite of his utmost precautions, they are all subject to 
minor errors. If he has exercised sufficient skill in his work, 
the errors will disclose two fundamental characteristics: (1) errors 
to one side will tend to be just as frequent as, but no more fre- 
quent than, errors to the other side; and (2) larger errors will 
tend to be less frequent than smaller errors, in fact, the larger 
the error the less frequently it will tend to occur. These funda- 
mental conditions give rise to a symmetrical distribution of the 
actual observations; they fall symmetrically around the measure- 
ment which may be regarded as representing the real size or 
position of the object observed. 

The same sort of distribution may be obtained from a different 
type of source. Suppose six coins are tossed together a great 
number of times, and the number of heads appearing on each toss 
observed and recorded. In this way data may be obtained for a 
frequency distribution in which the independent variable is the 
number of heads appearing in the toss, and the dependent vari- 
able is the number of tosses showing a given number of heads. 
Results actually obtained from 1000 tosses of six coins appear in 
the following table : 


TABLE 34. FREQUENCY DISTRIBUTION OF NUMBER OF HEADS PER Toss 
IN 1000 TossEs oF Srx CoINs 


NumBer OF Heaps | NuMBER OF TossES SHOWING 
tn Toss StaTED NUMBER OF HEADS 


AnkhWHN HO 
is) 
dS 
as 
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In a distribution of this sort, it is perfectly obvious that six 
heads, or no heads at all, will appear much less frequently than, 
say, two heads; the chances favor certain throws as against 
certain others. Though the variable is clearly discrete in char- 
acter, the cases are symmetrically distributed. The distribution 
assumes the same general form as in the case of the variable 
values obtained by repeated measurements of the same object. 

Empirical distributions of the sort developed by coin-tossing 
follow lines which are readily mapped out by purely theoretical 
calculations. Consider the expansion of the binomial (p + q)”. 
This is given by the expression 


p a8 npg ae AM = VY pa-2¢2 a n(n Bie. 1)( = 2) pr 3g ere 
I-2 ine eae) 

The successive terms of this expression give the relative frequen- 

cies of the classes in an ideal frequency distribution which is 

approximated more or less closely by empirical results. 

Take the case of coin-tossing just considered. The term p 
is here the chance with each coin on each toss of obtaining a head. 
This is, of course, one-half. The chance of failure to secure a 
head — namely, g—is likewise one-half. The theoretical dis- 
tribution of the relative frequencies in this case is to be found, 
therefore, in the expansion of the binomial ($+ 4)® The 
absolute frequencies are to be found in N times this expression, 
or in this case: 


1000(¢z terteetetetteat $2) 
The theoretical distribution is, therefore, as follows: 
10 + 94 + 234 + 312 +234 + 94 + 16 


The correspondence of empirical and theoretical results is mani- 
festly close. 

Theoretically, the case of distributions given by chance may 
be generalized still further by assuming an infinitely large number 
of items and an infinitely small gradation of the independent 
variable. Under these conditions, the distribution assumes the 
form of a smooth curve, which is known as the normal frequency, 
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or probability, curve (the “curve of error”)! An illustration of 
this curve is given in Chart 26.7 


CHART 26. Form or THE NORMAL FREQUENCY CURVE 


Maximum 
Ordinate 


fe Se cya ee AN Ly 
po ee 


ve 
ie eed 
(en EY) a 


5.0 40 3.0 2.0 1.0 (6) 1.0 2.0 3.0 40 5.0 
Standard Deviation 


Study of the characteristics of this normal distribution con- 
stitutes an important part of more advanced statistical analysis. 


1 Sometimes referred to as the Gaussian, or La Place-Gaussian curve, after the mathe- 
maticians who first gave it study. 

2 The equation of the curve is as follows: — x 

I 2 
y= ee 
V oro 

where e and m are the fundamental mathematical constants equal to 2.7183 and 3.1416 
respectively, o is the standard deviation of the independent variable X, x is the deviation 
of the independent variable from its mean value, and y is the ordinate expressed in fractional 
parts of the maximum ordinate. When « is zero — in other words, when we are dealing with 
the mean of the independent variable — y,, the frequency, is given by the expression 


where JV is the total number of cases in the distribution. This value of y is the maximum 
ordinate of the distribution. The other ordinates — or values of y — given by the equation 
are fractional parts of this maximum ordinate. 

The curve is obviously symmetrical. This follows from the fact that « appears only as a 
square; the value of y is the same, therefore, whether a given value of X is positive or 
negative. It is clear that there is no limit upon the value of x in either direction; the curve 
approaches the X axis as an asymtote in both directions. The relative flatness of the curve 
is determined by o: the greater the value of ¢, the flatter the curve, 
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It is not necessary, however, in this introductory study to under- 

- také any detailed examination of the curve. It is sufficient merely 
to note its general characteristics and to recognize it as one dis- 
tinct type of frequency distribution. 

Though the symmetrical distribution is of great theoretic 
importance and is commonly encountered in some fields — 
notably biology and anthropology — it is not of frequent occur- 
rence in economic and business research. Far more common 
in business and economic material are.asymmetrical distributions. 
In these the point of maximum frequency lies off the center of 
the distribution. The distribution is distorted or “skewed.” 


TABLE 35. TERMS OF THE BINOMIAL SERIES 10,000 (p + q)?? FOR VALUES 
OF p FROM 0.5 TO 0.1 (Figures given to the nearest unit) 


NUMBER OF p=0.5 p=0.4 p =0.3 p=0.2 p=o.1 
SUCCESSES qg =0.5 q =0.6 g =0.7 g=0.8 g=0.9 
(a) (0) (©) (d) (e) 
° 8 IIS 1216 
I 5 68 576 2702 
2 2 31 278 1369 2852 
3 It 123 716 2054 IQOI 
4 46 350 1304 2182 898 
5 148 746 1789 1746 319 
7; 6 370 1244 1916 I0QI 89 
7 739 1659 1643 545 20 
8 1201 1797 II44 222 4 
9 1602 1597 654 74 I 
Io 1762 I17I 308 20 
II 1602 710 120 5 
12 1201 355 39 I 
13 739 146 $e) 
id 379° 49 2 
I5 148 13 
16 40 3 
17 II 
_18 2 
19 
20 5 


After Yule, Introduction lo the Theory of Statistics, p. 294. 


The relationship between the symmetrical and asymmetrical, 
may be indicated by referring again to the expansion of the bi- 
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nomial (p+ q)”. In the case considered above, p was taken 
equal tog. But suppose p and q are unequal. The distributions 
then become asymmetric, asymmetry increasing as p and q be- 
come more and more unequal. The conditions are indicated in 
Table 35. In successive columns from left to right p and g 
are given values which gradually depart from the conditions of 
equality shown in column (a). The corresponding curves ap- 
pear in Chart 27. 


CHART 27. VARYING DEGREES OF ASYMMETRY IN FREQUENCY DISTRIBUTIONS 


Hundreds 
of Cases 
30 


2 


ci 5 10 5 “20, 
Number of Successes 


The condition of asymmetry thus has a definite theoretical rela- 


tionship to the condition of perfect symmetry previously con- 
sidered. 
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Asymmetrical distributions, as already indicated, are custom- 
arily classified as moderately asymmetrical, extremely asym- 
metrical, and J-shaped. The moderate and extreme types are 
fairly common; the J-shaped much less so. The characteristics 
of each may be concretely illustrated. 

Any number of examples of the moderately asymmetrical dis- 
tribution might be cited; a single one will suffice. In Table 36 
and Chart 28, male weavers in selected cotton mills in the United 
States are classified according to their weekly wage-rates. 


FREQUENCY DISTRIBUTION OF WEEKLY WAGE-RATES AMONG MALE WEAVERS 
IN SELECTED Cotton MILLs IN THE UNITED STATES 


TABLE 36 CHART 28 
WEEKLY NuMBER OF 
WAGE-RATES WORKERS 
~ Number of 
$ 2.00-21.99 2706 Workers 
600 
$ 2.00- 2.99 20 
3:00- 3-99 34 500 
4.00~ 4.99 4° 
5:00- 5-99 47 
6.00— 6.99 52 400 \ 
7.00- 7.99 76 
8.00- 8.99 83 
g.00- 9.99 99 300 
I0.00-10.99 100 
II.00-11.99 IIo 
I2.00-1 2.99 133 ne 
I3.00-13.99 215 
I4.00-14.99 490 100 
15-00~15.99 379 
16.00-16.99 328 
I7.00-17.99 280 re) Vase 
18.00-18.99 156 2 5 10 15 20 22 
19.00-19.99 | 8r Weekly Wage Rates (in dollars) 
20.00-20.99 34 
21.00-21.99 9 


If this chart is closely examined, it will be found that the maxi- 
mum frequency is clearly to one side — in this instance, the upper 
side — of the middle of the range. The distribution lacks 
symmetry. Yet it is not far from symmetrical in general form, 
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In other words, it is moderately asymmetrical. In the material 
of economics and business —in fact, throughout the social 
sciences — this type of frequency distribution is more commonly 
encountered than any other. 

The third type of distribution — the extremely asymmetrical 
— may be illustrated by the data shown in Table 37 and Chart 20. 


FREQUENCY DISTRIBUTION OF REPORTED INCOME AMONG NEW YorK STATE 
RESIDENTS IN THE LEGAL PROFESSION, 1920 


TABLE 37 CHART 29 
Income CLAss Biase = 08 
RETURNS 
$  o- $1,000 14 
I,000- 2,000 675 
2,000- 3,000 1,091 ge 
CHEE AEC 1,304 Returns (incomes of #40,000 and over omitted) 
4,000- 5,000 1,081 
5,000- 6,000 600 
6,000- 7,000 482 
7;000—- 8,000 389 
8,000- 9,000 302 
Q,000- 10,000 211 
I0,000— 11,000 194 
II,00O- 12,000 I51 
I2,000- 13,000 127 
I3,000- 14,000 130 
I4,000- 15,000 100 
I5,000- 20,000 370 
20,000- 25,000 201 
25,000- 30,000 125 © 5 10 15 20 25 30 35 40 
30,000- 40,000 179 Income {in thousands of dollars) 
40,000- 50,000 88 
50,000- 60,000 50 
60,000- 70,000 44 
70,000— 80,000 26 
80,000- 90,000 I5 
Q0,000-100,000 II 
100,c00-and over 23 Source: Annual Report of the N. Y. State Tax Com- 
mission, 1922, pp. 508-509. 
The fourth type of distribution — - =— 


a maximum frequency in one of the_terminal classes of the 
distribution. The form may be illustrated by the data given 
on the next page. 
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FREQUENCY DISTRIBUTION OF NUMBER OF CHILDREN PER WIPE IN Fatt RIVER, 


MASSACHUSETTS 
TABLE 38 CHART 30 
NuMBER oF Cuit-| NUMBER OF 
DREN TO WIFE WIVES 
None 3,381 Naver 
t child BOTT Wives 
2 children 2,031 4000 
3 children 2,538 
4 children 2,007 
5 children 1,761 EIEN 
6 children 1,564 
7 children 1,256 
8 children 16.517 
g children 933 2000 
to children Ohh 
_ az children 573 
12 children 413 
13 children 279 1000 
14 children 178 
15 children 105 
16 children 47 0 —— 
17 children 38 0 5 10 15 20 25 
18 children 22 Number of Children to a Wife 
tg children 8 
20 children 3 
21 children 3 


Frequencies in this distribution decline from a maximum in the 
class of no children to a minimum in the class giving the largest 
reported number of children. 

An important point regarding “‘J-shaped”’ distributions is to 
be noted, Distributions which appear on first examination to be 
“J-shaped” are sometimes resolved into extremely asymmetrical 
distributions upon the adoption of a smaller class interval. Thus, 
if the distribution of income in the United States is shown in 
$2000 intervals, the shape of the distribution appears to be 
“J-shaped.” If, however, the same data are classified in $100 
intervals, the distribution assumes an extremely asymmetrical 
form, the maximum frequency occurring in the class $900 to 
$1000.' One should be on guard against assuming that because 


1 National Bureau of Economic Research, Income in the United States, 1910, pp. 132-133; 
Table 25. 
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the maximum frequency appears in a terminal class, the dis- 
tribution is essentially “J-shaped” and will show a maximum 
frequency in the terminal class, however small the class is made.! 

It is not to be thought that the distinctions among the four 
types of distribution just indicated are always clear-cut. Obvi- 
ously, there is a zone within which one may be in doubt as to 
whether a given distribution is to be regarded as essentially 
symmetrical or moderately asymmetrical. Similarly, the dif- 
ference between moderately asymmetrical and extremely asym- 
metrical distributions shows imperceptible gradations. The 
illustrations given above have been chosen so as to make the 
types clear. At times the distinctions are not as easily drawn 
as in the cases cited. Even so, the distinctions between the four 
types are important and should be recognized. 

Distributions of the types thus far considered are sometimes 
further classified with reference to the conceivable range of 
variation into three groups: (1) those in which the range is 
essentially unlimited in both directions; (2) distributions in 
which the range is limited in one direction but presumably un- 
limited in the other; (3) distributions in which the range is 
limited in both directions. Among distributions gathered from 
economic and business analyses, the contrast is more likely to be 
between limits on the variable which are abrupt and more or less 
absolute in character, and limits which are remote and somewhat 
vague though presumably not indefinitely removed. Thus, 
age distribution is obviously limited at the lower end, but not so 
precisely limited at the upper. Wages are quite likely to exhibit 
no abrupt limitation at either end. In dealing with cases of 
asymmetry, it is commonly desirable to know whether or not 
there are limits beyond which items either cannot occur at all, 
or are altogether unlikely to occur. 

The simple distributions cited in the four fundamental types 
cited above have all shown one point of concentration — a 
definite maximum frequency. Such distributions are said to be 


11Tf the variable is discrete, and the class interval is a single unit, as in the illustration 
of Table 38, this complication can hardly arise. 
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unimodal, and are to be contrasted with distributions showing 
two or more points of concentration. Distributions of this 
latter form are said to be multi-modal. An illustration is given 
in Table 39 and Chart 31. 


TABLE 39. FREQUENCY DISTRIBUTION OF HourLy RATES OF WAGES AMONG 
FEMALE Cotton SPINNERS 


RATES PER Hour NUMBER OF SPIN- 
(in cents) NERS 
6 and under 8 252 
8 “cc “cc Io 929 
TOM A 5g) 1,380 
Tes ars 072 
i, EE aS) 1,619 
TOmmies ES 989 
Toi sian) Bek 
ae <>22 72 
22 “cc “ce 24 30 
24 6 6c 26 2 


CHART 31. FREQUENCY DISTRIBUTION oF HourLY Rates OF WAGES AMONG 


FEMALE CoTTON SPINNERS 
No. of 


1250 


750 


500 


250 


Wages per hour [in cents) 
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Multi-modal distributions assume a wide variety of forms and 
defy systematic classification. They call rather for individual 
treatment. Commonly, they suggest further analysis of the 
materia! with a view to the detection of an admixture of data. 
The presence of two or more modes is presumptive evidence of a 
mixture of two distinct variables, each with its own characteristic 
point of concentration. In general, multi-modal distributions 
may be said to be under suspicion ; certainly they suggest analysis 
of the original data with reference to the possible subdivision of 
the individual cases into more homogeneous groups. 

The illustration just given is a casein point. The distribution 
of female cotton spinners by wage-rates manifestly is multi- 
modal, there being one maximum at $11.00 and another at 
$15.00. Detailed examination of the original returns with a 
subdivision of the items according to groups of mills separates 
the distribution, however, into the two distributions given in 
Table 40 and Chart 32. 


TABLE 40. FREQUENCY DISTRIBUTION OF HouURLY RATES OF WAGES AMONG 
FEMALE COTTON SPINNERS IN GRoupPS A AND B 


NuMBER OF SPINNERS IN MILLs OF 
Rates PER Hour 


(in cents) 
GROUP A GROUP B 
6 and under 8 241 II 
oh ae we 56) 809 120 
TOs A 518) 1034 346 
ey oe et Al 464 708 
TAae. ao 159 1460 
TOmeee A ite) 26 963 
fey 20) 14 317 
20) se 22 3 69 
Soy Sse fo) 30 
oy, 2 Ae) ° 2 


Here we have two unimodal distributions of homogeneous items. 
The distributions fall readily into the simple classification of 
types given earlier in the chapter. 

The differences emphasized in the various classifications of 


132 STATISTICAL ANALYSIS 


\\ frequency distributions indicate that there are at least three 
‘© features in which frequency distributions may differ radically. 
They may show entirely different typical sizes in the variable. 
In other words, essentially different ‘‘central tendencies”’ may 


\ result in essentially different points of concentration or of maxi- 
y, mum frequency. In the second place, the extent of concentra- 
¥ \ tion of items may vary widely. Some distributions are much 
* CHART 32. FREQUENCY DISTRIBUTION OF HouRLY RATES OF WAGES AMONG 
NY, FEMALE CoTron SPINNERS IN Groups A AND B 
\ } 


2”, 12 14 16 18 
Wages per hour (in cents} 


more compact than others; some show a high degree of concen- 
tration about the point of maximum frequency, others a wide 
dispersion. In the third place, distributions may differ widely 
in their symmetry. Some show essentially the same conditions 
on either side of the point of maximum frequency; others show 
very different conditions on the two sides. 

Corresponding to these three points in which distributions 
may differ fundamentally, there are three important lines of 
statistical analysis. Frequency distributions may be analyzed 
with reference to the typical size of item which they display ; 
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with reference to the variability ordispersion of the items; and 
with reference to the asymmetry, or skewness, of the distribution. 
These three lines of investigation — all fundamental in the 
examination of attributive variation — constitute the subject 
matter of the following two chapters. 


CHAPTER X 
TYPICAL SIZE: AVERAGES 


Ir has been shown in the preceding chapters that the record 
of a variable may often be effectively summarized in a frequency 
distribution. Summarization may, however, be carried much 
further. In an average, the record of the variable is compressed 
into a single figure which in peculiar degree is a representation or 
summarization of the full record of the variable. 

Averages constitute one of the most useful instruments of 
statistical analysis and merit most careful study. They have 
long been a subject of statistical interest, and the technique of 
their use has already been fully elaborated. In general, the 
consideration of averages involves two distinct lines of investiga- 
tion: (1) the definition of different serviceable averages and of the 
best methods of obtaining each; and (2) the evaluation of the 
different averages and the determination of the basis of selecting 
an average in the analysis of any particular body of data. These 
subjects will be presented in the two sections which follow. 


A. Tue DEFINITION AND COMPUTATION OF AVERAGES 


There is no limit to the conceivable variety of averages. By 
nature, an average is merely an intermediate and representative 
value of the variable. With any particular set of items, the 
ingenuity and patience of the analyst may bring forth an in- 
definitely large number of such intermediate values. Only five 
forms of average, however, merit attention in elementary sta- 
tistical analysis. These five are the (1) arithmetic mean; 
(2) geometric mean; (3) harmonic mean; (4) median; and 
(5) mode. Let us consider in turn how each of these is to be 
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defined, and how each is to be obtained from a given set of 
data.! 


Arithmetic Mean 


The most familiar of the averages is the arithmetic mean. It 
is so universally employed that the term ‘‘average”’ is frequently 
applied to it without qualifying adjective. The arithmetic mean 
is the “average”? of common speech. 

The nature of the arithmetic mean is known to practically all. 
It is the sum of the items divided by their number. Thus, the 
arithmetic mean of 2, 3, 4, 5, and 6 is 

Sete Tee ane ey, 
5 5 
In terms of the generalized notation just suggested (see footnote 


below), 
Oe een 


M, 
N N 


where DX is the sum of all the items of form X, and N the number 
of such items. 


1Jn the explanations of this chapter it is desirable to work with a generalized form for 
the record of the variable. Suppose that reports of earnings of employes in a given 
factory for a specified week, when arranged in the order of ascending magnitude, run as 
shown in column (A). 


(A) (B) 
$14.65 xX, 
14.75 X2 
14.80 X3 
14.05 X4 
37-59 Xn 


Such a run of items may be represented in general by the letters given in column (B), where 
m is the total number of items. This form of notation will be employed wherever it is 
desirable to represent the record of the variable in simple generalized form. 

Similarly it will help in the exposition if the several averages are represented at times by 
symbols. A simple set of these follows: 


Arithmetigmean . . . .. «. Me 
Geometricmean - ... : . M, 
Harmonigmean <=». . -. « Ath 
Wiselbin, “Gas” 5 1 ce oe Gn ened 


MOdeR animes Mlb. wok ami My 
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The arithmetic mean is easily calculated. When the obser- 
vations or reports of the variable are still unclassified — in other 
words, when no frequency distribution has been formed — and 
when, furthermore, no classification is essential to the analysis, 
the arithmetic mean is most satisfactorily secured through direct 
summation of the items. Division of this sum by the number of 
items gives the desired average. 

More commonly, the problem of obtaining the arithmetic 
mean arises after the data have been cast into a frequency dis- 
tribution. The sum of the items is then obtained by multiplying 
the mid-value of each class by the corresponding frequency. 
Dividing by the total frequencies — or, in other words, by the total 
number of items — gives the desired mean. If the mid-values 
of the successive classes are represented by X1, Xo, X3 °°: Xn, 
and the frequencies of the corresponding classes by fi, fo, fs «+ fn, 
the arithmetic mean from a frequency table is given in generalized 
form by the expression : } 


iV gee f:XitfoXe + ++ frXn = =(fX) 
f fi tfet ofr af 


This method, employed in calculating the arithmetic mean of 
a variable presented in a frequency table, rests upon a funda- 
mental assumption: namely, that the items in any class of the 
distribution are evenly distributed over the interval so that the 
arithmetic mean of the values of the items in the class is equal 
to the mid-value of the class.2. The importance of this assump- 
tion in the calculation of averages from the frequency distribu- 
tion is to be carefully recognized. Obviously, the assumption is 
not likely to be in perfect conformity to the data. But if the 
class intervals are kept relatively small, the assumption involves 
no substantial error for the distribution as a whole unless the 
distribution be extremely asymmetric. In the latter case errors 
in the assumption do not compensate for one another in the 


1 This forrnula is just as applicable to percentage distributions as to distributions in 
absolute numbers. 

2 A somewhat different assumption may be made; namely, that all the items of each class 
are massed at the mid-value. 
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computation of such an average as the arithmetic mean. As a 
result, the mean computed from the frequency distribution may 
differ substantially from that obtained by direct summation of 
the individual records of the variable. The point is one that 
should be kept constantly in mind in the construction of the 
frequency distribution.! 

The method, just outlined, of obtaining an arithmetic mean 
from a frequency table is shown in full below: 


TABLE 41. CALCULATION OF ARITHMETIC MEAN FROM SIMPLE FREQUENCY 
DISTRIBUTION: Lonc METHOD 


(Wage-rates per Hour of Cotton Spinners) 


WAGE-RATE PER Hour | NuMBER OF SPINNERS Mip-vaAtue oF Crass |Mip-vALtuEe Times 
(in cents) RECEIVING RATES INTERVAL FREQUENCY 
6.0- 7.8 2 6.9 13.8 
8.0- 9.8 17 8.9 151.3 
10.0-11.8 36 10.9 392.4 
iP Sauetye) 68 12.9 877.2 
14.0-15.8 100 14.9 1490.0 
16.0-17.8 cee) 16.9 2011.1 
18.0-19.8 124 18.9 2343.6 
20.0-21.8 125 20.9 2612.5 
DOD eee) 120 22.9 2748.0 
24.0-25.8 104 24.9 2589.6 
26.0-27.8 64 26.9 1721.6 
28.0-29.8 35 28.9 IOII.5 

30.0-31.8 26 30.9 803.4 
32,0—33,0 20 32.9 658.0 
34.0-35.8 15 34.9 523.5 
36.0-37.8 II 36.9 405.9 
38.0-39.8 7 38.9 272.3 
40.0-41.8 5 40.9 204.5 
42.0-43.8 3 42.9 128.7 
44.0-45.8 2 44.0 89.8 
46.0-47.8 I 46.9 46.9 
48.0-49.8 I 48.9 48.9 

1005 21144.5 

Ml = 21TAA'S = 21.04¢ 
1005 


1JIn connection with that subject and the present one of averages, it is an instructive 
exercise for the student to compute the arithmetic mean from both the unclassified data and 
the corresponding data in frequency form, a careful examination being made of any dif- 
ference in the averages obtained in the two ways. 
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The computation here is straight-forward but somewhat la- 
borious. . 

A method which is much simpler may be applied in most 
instances. This simpler method is known as the short-cut or 
step-deviation method. The several steps of the method are as 
follows : 


t. Take as an arbitrary origin — A — the mid-value of one 
of the classes of greatest frequency ; then the mid-value of 
each class will be a certain number of class-intervals or 
steps greater (+) or smaller (—) than the mid-value at the 
arbitrary origin.! 

2. Multiply the frequency of each class by its deviation in 
intervals from the arbitrary origin. 

3. Find the algebraic sum of these step-deviations. 
Divide this sum by the total number of items. 


5. Multiply this quotient (which is the correction in class 
intervals) by the size of the interval. 


6. The result indicates the correction — c — plus or minus as 
the case may be, which is required to obtain M, from A 
where M,=A+c 


7. The result finally may be verified by working through the 
calculation with a different arbitrary origin.” 


A concrete illustration of the calculation of the arithmetic mean 
by the short-cut method is given in Table 42. 

The advantage of this method lies in the fact that it deals in 
smaller even numbers. These are introduced partly through the 
substitution of deviations for the original values of the variable, 
partly through the use of intervals instead of the original measures 
in the calculation of the algebraic sum of the deviations. As a 

1 At this point, the difficulties arising from unequal intervals or indeterminate extreme 
classes are serious. The latter can be handled only after an estimate has been made of the 
average size of the items in each of the classes. 

2 The correction factor, c, used in this method, occurs in the computations of other 
statistical measures. Its value is always given by the expression Ma — A; i.e. the 
difference between the arithmetic mean and the arbitrary origin. If the deviation of the 
original items — X1 X2 X3 °° X,— from the arbitrary origin—— A —are represented by 


N 


carefully noted. It may be stated in units of the interval or of original measurement. 


J Alf 0 eee , o © 
%1%2%3°" Xn, the value of ¢ is equal to This expression for c should be 
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result of these substitutions, the figures of the computation are 
much easier to handle, and the arithmetic task of obtaining the 
mean is reduced to the simplest possible terms.! 


TABLE 42. CALCULATION OF ARITHMETIC MEAN FROM SIMPLE FREQUENCY 
DIsTRIBUTION: SHORT-cUT METHOD 


(Wage-rates per Hour of Cotton Spinners) 


aes eke EEE. oer Re- vane PROM. ARSE DEvIATION TrvEs} Sum oF DEVIATIONS 
(in cents) CEIVING RATES ‘Gn intervals), FREQUENCY Tres FREQUENCY 

6.0—- 7.8 vy" 2 =—4 tA) 
8.0- 9.8 i a7 —6 —I02 
10.0-11.8 36 = 55 — 180 
12.0-13.8 68 = i — 272 
I4.0-15.8 100 = 2 — ee) 
16.0-17.8 11g = 2 —238 
18.0-19.8 — 124 —t1I —124 

20,0-21.8-4 0125 .4/ fo) — 1230 
22.0-23.8 4120 + 1 120 
24.0—25.8 I04 +2 208 
26.0-27.8 64 +3 IQ2 
28.0-20.8 35 + 4 140 
30.0-31.8 26 aes 130 
32.0-33.8 20 <6) 120 
34.0-35.8 15 oY, 105 
36.0-37.8 Ti ses 88 
38.0-30.8 i Po) 63 
40.0-41.8 5 3 fe) 50 
42.0-43.8 3 = It 33 
44.0-45.8 2 -- 12 24 
46.0-47.8 I ap mk 13 

48.0-49.8 I -- 14 14 + 1300 

1005 We 

M,=A+e¢ A = 20.9 


= Ege Bae = + 107, interval = +- .14¢ 
1005 


Ma, = 20.9 + .14 = 21.04 


i The validity of the short-cut method rests upon the fact that the algebraic sum of the 
deviations of any set of items from their arithmetic mean is 0. It follows that the dif- 
ference between an arbitrarily chosen origin and the actual mean is equal to the algebraic 
sum of the deviations divided by the total number of items. This difference, c, which is 
always equal to M,— A, is, as already noted, important in the calculation of many 
statistical coefficients. 
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Geometric Mean 


The geometric mean is not as common as the arithmetic, but 
has special applications in statistical analysis and is increasingly 
used. It is important, therefore, to understand its nature, and 
the means by which it may be obtained. 

The geometric mean is defined as the Mth root of the product 
of the V items. Thus, the geometric mean of 2, 3, 4, 5, and 6 is 
V2X3X4X5X6 =V720 = 3.73. In the notation already 
introduced, M, = VOCE E Pea 22h ne 


Computation of the geometric mean by elementary arithmetic 
processes is laborious. The amount of work is greatly reduced, 
however, through the use of logarithms. In terms of logarithms, 


log M, = log X; + log = + ---logX, _ 2 Hee x 


In other words, the geometric mean is the number corresponding 
to — 1.e. the antilog of — the arithmetic mean of the logarithms 
of the original items. 

If the data are in the form of a frequency distribution, the 
mid-values of the successive classes are dealt with just as the 
mid-values were in the calculation of the arithmetic mean. In 
obtaining the geometric mean the frequencies appear as ex- 
ponents of the mid-values, the formula being 


My = V(Xi)hs (Xa)e (Xn) 


where X;, X2 -+: X, are the mid-values of the successive classes. 
Logarithms are ordinarily employed to facilitate computation of 
this result, in which case the formula becomes: 


log M, cli -log X + fo : log X» + ofa ‘log X, 2 > f(log X) 
Fee ere sf 


The form of the computation of the geometric mean, using 
logarithms, is indicated in the following table : 


1The geometric mean of a given set of numbers is always less than the arithmetic mean. 
It is obvious that the geometric mean is zero if any item is zero. 


% 
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TABLE 43. CALCULATION OF GEOMETRIC MEAN FROM 
SIMPLE FREQUENCY DISTRIBUTION 


(Increases in Value of Products of Individual Industries during 
Single Intercensal Period) 


PERCENTAGE INCREASE LOGARITHM OF LocaritHM TIMES 


NuMBER OF INDUSTRIES 


(nearest 10 per cent) MiD-VALUE FREQUENCY 
Io Br I.0000 37.0000 
20 IO5 1.3010 136.6050 
30 143 1.4770 211.2110 
40 102 1.6021 163.4142 
50 96 1.6991 163.1136 
60 81 1.7782 144.0342 
7° 69 1.8451 127.3119 
80 54 1.9030 102.7620 
fare) 42 1.9541 82.0722 

I00 36 2.0000 72.0000 
IIo Br 2.0414 63.2834 
I20 23 2.0792 47.8216 
130 18 2.1139 38.0502 
I40 9 2.1461 19.3149 
I50 4 2.1761 8.7044 
160 2 2.2041 4.4082 
170 2 2.2304 4.4608 
180 I CYS 2.2553 

855 1427.8229 

1427.8229 
log M, = eee = 1.6699 My, = 46.8% 


Even with the use of logarithms, this computation is somewhat 
laborious. It may be expedited, however, through the use of 
specially constructed slide-rules. 


Harmonic Mean 


The harmonic mean is even less common than the geometric. 
It nevertheless has certain special uses which are later to be noted, 
and a brief explanation of its nature and the method of its cal- 
culation is desirable. 

The harmonic mean is defined as the reciprocal of the arith- 
metic mean of the reciprocals of the items. Thus, the harmonic 
mean of 2, 3, 4, 5, and 6 is 
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SxEreverec go 
5 
In the customary notation : 
eoure I = ul r 41 ma bree z 26) 
Dey MS DOA oe Gis oe Een 
N 


In calculating the harmonic mean, use should ordinarily be 
made of a table of reciprocals. If the mean is obtained from the 
individual records of the variable, it appears as the reciprocal 
of the arithmetic mean of the reciprocals of the items. If, upon 
the other hand, the mean is obtained from a frequency distribu- 
tion, the reciprocals of the mid-values of the class are multiplied 
by the associated frequencies and the mean of the reciprocals 
obtained from the sum of these products. The harmonic mean 
is the reciprocal of this result. The underlying formula may be 
given as follows : 


AGE) + 2Ga) ee 20) : =/(2) 
M, = Ae cae 
or M, = 


Zf 
See 
=f (x) 
where f is the frequency of the class, and X its mid-value. 


The method of obtaining the harmonic mean from a frequency 
table is illustrated in Table 44. 


Median 


The averages thus far considered have all been calculated 
means; that is, they have been means obtained by mathematical 


1 The harmonic mean of any set of numbers is less than the geometric, which in turn, as 
we have seen, is less than the arithmetic. The three means really form a geometric pro- 
gression. In other words MW, = VM a’ My. In the illustration given above, in which 
the numbers averaged were 2, 3, 4, 5, and 6, the M, was 4, the My 3.73, and the Mp 3.45, 
where obviously 3.73 = V 4.00 - 3.45. 
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processes. A contrasting type of average is the average of posi- 
tion. Averages of this sort are defined in terms of definite 
arrangements or orderings of the individual items. The two 
averages of position that are to be considered in elementary 
analysis are the median and the mode. Of these two, the median 
is the simpler. It will, therefore, be considered first. 


TABLE 44. CALCULATION OF HARMONIC MEAN FROM 
SIMPLE FREQUENCY DISTRIBUTION 


(Turnarounds on Trips of Army Transports) 


Pee Bowarctor rere, |( agescris one 7 aco 
34 5) 0204 1470 
35 II .0286 -3146 
30 24 .0278 .6672 
37 Dr .0270 .5670 
38 33 .0263 8679 
39 42 .0256 1.0752 
40 49 0250 1.2250 
41 —— 47— .0244 1.1468 
42 46 .0238 1.0948 
43 38 .0233 8854 
44 35 0227 -7945 
45 26 .0222 5772 
46 17 .0217 .3689 
47 12 .0213 2556 
48 gh .0208 .0832 

410 10.0703 
2 20:0703 ip ae nae 
MM), 410 10.0703 


The median is commonly said to be the size of the middle item 
when the items are arranged in the order of magnitude; or, if 
the number of items is even and there is no single middle item, 
the median is held to be the arithmetic mean of the two middle 
items. If the full details of the original observations are avail- 
able, this concept of the median serves very well. A somewhat 
different definition is at times, however, more simple of applica- 
tion. This is the definition of the median as that point on the 
scale of the variable which divides the items, when arranged in 
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the order of magnitude, into two groups, each containing one- 
half of the total number. 

Determination of the median from the full details or reports 
of a variable is extremely simple. Under these conditions the 
first definition of the median is directly applicable. The items 
need only to be arranged in ascending or descending order of 
magnitude and the size of the middle item (or the arithmetic 
mean of the values of the two middle items if the number of items 
be even) noted in this array. This value is the median. 

If, on the other hand, the median has to be obtained from a 
frequency distribution, its determination is not as simple. The 
problem in this case is. to interpolate for the median in one of 
the classes of the distribution. Two methods are employed: 
(1) the arithmetic, and (2) the graphic. Both require careful 
explanation. 

Arithmetic interpolation of the median proceeds most simply 
if based on the second concept of the median; namely, that the 
median is that position on the scale of the variable which divides 
the distribution into two equal parts.1_ The assumption is made 
of an even spacing of the items within the frequency class in which 
the median lies. The first and last of the items in this class are 
assumed to be one-half of a sub-interval from the class limits, 
the sub-interval being the distance from one of the evenly spaced 
items to the next.2, The median is found by cumulating items 
from the lower end of the distribution up to the class in which the 


1 The number in each part is given by the expression NV 
2 


2 This removal of the first and last items from the class limits is to avoid the objection 
that an item placed on the class limit may virtually coincide with an item practically on the 
class limit in the adjacent class. (See Zizek, Statistical Averages, pp. 208-209.) 

The situation may be shown schematically as follows: 


Lower . Upper 
Class Limit Class Limit 
- Class Interval in which median lies 


Here 37 items (the number in the class) are evenly spaced over the class interval, the distance 
between any two successive items being a sub-interval. 
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median is located,! then adding to the lower limit of this class the 
number of sub-intervals represented by the number of items that 


have to be added to include Ny If Z is the lower limit of the 
2 


class in which the median lies, f the frequency of items in this 
class, 7 the class interval, and g the number of items which must 
be added to reach the median, the formula for locating the median 
is as follows : ? i 

Me =L +s) 

The way in which arithmetic interpolation proceeds may be 
made clearer by a concrete illustration. Suppose the problem 
is to determine the median of the variable given in the distribution 
in Table 45. 


TABLE 45. INTERPOLATION OF MEDIAN IN SIMPLE FREQUENCY DISTRIBUTION 
(Percentage of Members Unemployed in 120 Trade Unions) 


PERCENTAGE UNEMPLOYED | NuMBER OF UNIONS REPORTING 
0.0-9.9 120 
0.0-0.9 ne } | 
101.0" 6 30. 
2.0-2.9 Gype 
3.0-3.9 17 
4.0-4.9 Io 
5:0-5-9 7 
6.0-6.9 5 
7-0-7.9 I 
8.0-8.9 I 
9.0-9.9 I 


1 Of course, interpolation may be worked from the upper as well as the lower end of the 
distribution. It is left to the student to develop the procedure and make application in an 
illustrative case. 

2 If the median is conceived to be the size of a particular item, the item is to be found in the 
N+1 


2 


array from the expression Thus, if there are 113 items in the array, the median is the 


fifty-seventh item (ats being equal to 57). The formula to apply in interpolating for 
2 


the median on this basis is: 


M.=L+(¢- »G) 
This may be transformed to read: 

tpge 214 
z i ( af )i 


3 Clearly the median lies in this class. 
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In this distribution Wietoen ihe median, then, is that position 
2 


on the scale of the variable which divides the distribution into 
two equal parts of 60 items each. If the items are cumulated 
from the lower end of the scale, it will be found that 41 are in- 
cluded in the first two classes, that is, in the classes 0.0-0.9 and 
1.0-1.9. In the next class there are 37 items. To reach the 
position on the scale which lies halfway between the 6oth and 
61st items — which, in other words, divides the distribution into 
two equal parts of 60 items each —it is necessary to cumulate 19 of 
the 37 items included in the class from 2.0 to 2.9. The lower limit 
of this class is 1.95. The median is given then by the expression : 


M; = 1.95 + o(=) = 1.95 + .51 


The median is 2.46, or 2.5. 

It might appear that if the variable is actually discrete, the 
assumption of even spacing — made in the method of arithmetic 
interpolation just explained —is not warranted. A discrete 
variable, as we have seen, falls in fact only upon stated intervals 
of the scale. The logical assumption as to the location of items, 
therefore, would seem to be location in equal bunches on the 
possible positions of the scale. Any attempt to apply this notion, 
however, leads to considerable complexity in the interpolation. 
It is really unnecessary to deal with these. For the purposes of 
interpolation, continuity in the variable, both as to its class 
limits and the spacing of the items within these limits, may be 
safely assumed. If this assumption is made, the formula given 
above applied, and the result then finally rounded off to the 
nearest possible value for the variable, the median secured is 
identical with that which would be obtained if it were assumed 
that the items were bunched equally at the only possible values 
of the variable in question.1 

Interpolation for the median by the graphic method is perhaps 
' more satisfactory than by the arithmetic. The method in this 


Tf the variable is discrete it is ordinarily best to round off the value of the median to 
the nearest possible size of the variable. 
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case works with the cumulative frequency graph. With such a 
cumulative graph in hand the median is readily obtained by 


drawing a horizontal line from the vertical scale at i and drop- 
2 


ping from the intersection of this horizontal line with the cumula- 
tive graph a perpendicular to the horizontal scale. The foot of 
this perpendicular locates the median. The method is indicated 
in Chart 33. 


CHART 33. LOCATION OF MEDIAN FROM CURVE OF CUMULATIVE DISTRIBUTION 
(Percentage of Members Unemployed in 120 Trade Unions) 
Number of 


0 2.0 Me 30 40 5.0 6.0 7.0 8.0 9,0 10.0 
Percentage of Members Unemployed 


Mode 


The mode is another average of position. It may be defined 
as the most frequent size of item, or as that position on the scale 
of the variable at which items are most common, or “ thickest.”’ 
In terms of the frequency diagram, the mode is that point on the 
horizontal scale over which the figure rises to its greatest height. 


1Two varieties of the mode are recognized: the crude and the refined. The refined 
presupposes the determination of the theoretical frequency curve most closely fitting the 
distribution exhibited by the actual data. The refined mode is the value of the variable cor- 
responding to the maximum ordinate in this fitted frequency curve. Adequate treatment 
of the refined mode lies beyond the scope of elementary study. The present discussion is 
confined, therefore, to the so-called crude mode. 
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Determination of the mode can hardly be effected directly from 
unclassified data; it is essentially dependent upon the previous 
construction of a frequency distribution. Given such a frequency 
distribution, the mode may be crudely located at the mid-value 
of the class of maximum frequency.! This location of the mode 
may be improved, however, by taking into consideration any 
marked inequality of frequency in the adjacent classes. Thus, 
if L is the lower limit of the class of maximum frequency, 7 the 
class interval, f, the frequency in the next lower class, and f, 
the frequency in the next higher class, the mode may be de- 
termined by the formula :? 


= L+ 


ae 


A serious objection to this method of determining the mode is 
that the class of maximum frequency will vary considerably with 
the class interval and with the class limits employed in setting 
up the frequency distribution. To meet this objection a some- 
what different method of locating the mode is sometimes em- 
ployed. It may be described as the method of successive re- 
grouping. This method entails the formulation of a number of 
frequency distributions with different class intervals and class 
limits. An illustration is given below.. The frequency distribu- 
tion is first set up with ro-point classes, next with 20-point, then 
with 30-point, and finally with 50-point. With the 20- and 30- 
point classes, the class limits are shifted one interval (that is, 10 
points) at a time to cover all possible locations of the class limits. 
The mode is finally taken to be that value of the variable which 
appears in the largest number of classes of maximum frequency 
in the several different frequency distributions. The method is 
shown in Table 46 (p. 150). 

Still another method of locating the mode is by a formula 


1Tn the frequency distribution given in Table 17 (p. 64) this would be 2.45%. 
2 In the distribution just cited above this would be 


17 
TOS, ae 
36 + 17 


It may be desirable in some cases to let fm represent the combined frequencies of the two next 
lower classes and f;, the combined frequencies of the two next higher classes. 


= 1.05 + .32 = 2.27% 
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expressing a characteristic relationship existing between the 
mode, median, and the arithmetic mean. If the distribution is 
moderately asymmetric, and the arithmetic mean is not seri- 
ously influenced by extreme items, the relation of mode, median, 
and mean is as shown in Chart 34. 


CHART 34. COMPARISON OF ARITHMETIC MEAN, MEDIAN, AND MopE 


Mo Me Mg 


The mode may be stated, therefore, in terms of the other two as 
follows : k= A oie eal) 


Application of the formula presupposes the earlier determination 
of the arithmetic mean and median, is permissible only under the 
conditions stated, and yields at best only an approximate figure. 
Consequently, the formula is not commonly employed, save as a 
check on the determination of the different averages by other 
methods. 

Ideally, as already indicated, the mode is to be determined 
by fitting, through mathematical analysis, a frequency curve to 

1 On an ogive or cumulative frequency curve, the mode is theoretically given by the point 


of inflexion in the curve. This does not afford a practicable method of determination of the 
mode, however, because of the difficulty of locating exactly the point of inflexion. 
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TABLE 46. DETERMINATION OF MopDE BY METHOD OF SUCCESSIVE REGROUPING 
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(Data on Wage-Rates from Aldrich Report, after Bowley) 


NuMBER OF WAGE EARNERS 


WAGE-RATES 
10-CENT 
GROUPS 

From $.25 to .34 I 
poo 15 
= Baby 54 59 
“ 85 “e 64 85 
eee 05 en | LS 7) 
oe Si SC devll 113 
85 0d 169 
SOS min O4: 201 
WE ingore Ol iaivile ||) eyoyt 
Ge arnt OE rageyl 685, 
ays Oe Tie 99 
ESS eet A 45S 
By Sey! 406 
Seg So aesoy! 72 
eels OS muccmel 17.4. 202 
SI rss GE augeyt 320 
“c 1.85 “cc 1.04 58 
po 2O5 ee 2.040 an 273 
2-052. ha 45 
SO Te On ORO. 205 
SDDS Pos 33 
OF Oe A peru IOI 
Bo aban, SO aaRe a 196 
A pists BO GHG 13 
Sa LOSaee 2574 163 
DG A OEY 2 
SOLOS 62-04. 15 
yar OSs S04 a | LZ0 
3.05 "3.14 5 
Sg) 10 3-24 47 

SES 25 SAG 4 12 
“3:35 3.44 ° 
“3.45 “3.54 | 221 
3-55 3-64 5 

eS OSS 16 
OB Bk II 
BERS Bon ) 
S05 404 82 
ee AOS a CATA ° 
“4tsS 4.24 3 
“4.25 “ 4.34 ° 
“4.35 4.44 ° 
4.45% 4.54 3 
(455 464 I 
405 0 474 fe) 
“ 4-75 “ 4.84 S 
4.85 “ 4.04 ° 
eA: 9 55-04, 8 
OS Gros OS Baim ° 
eras 3 oy ° 
BOS Beh ri 


IN 
20-CENT 
GROUPS 


16 


$31 


134 


IN 30-CENT GROUPS 


IN 50-CENT GROUPS 


75 
159 
301 
355 
439 
483 
674 
1,190 
1,088 
1,242 
1,023 
996 
74° 
603 
5890 
660 
376 
583 
343 
399 
339 
310 
372 
178 
180 
146 
149 
181 
64 
59 
233 
226 
242 
32 
27 
93 
82 
85 
3} 
Si 
S 
4 
4 
I 
° 
8 
8 
8 


ese Oe FE Oe SS ee i 5. 


& 
Leal 
nN 


1,472 


1,207 


97° 


iS} al On 
Ke) n o ° 
on + ) fon 
eee eer eee SO an eer 2O >> OO > es rm xX vOaesrc3Ctt eer 2 So SS eee SSS 


ES, 


o 


725 


2,012 


934 


lon 
Bg 
O° 


Ww 
N 
N 


285 


114 


Source: Bowley, Elements of Statistics, p. 97. 


TYPICAL SIZE: AVERAGES TSAI 


the empirical distribution shown by the data. In general this 
method lies beyond the scope of elementary analysis. It is 
sometimes possible, nevertheless, to get a rough approximation 
to the ideal result by the free-hand construction of a simple fre- 
quency curve. The mode then appears as the value of the inde- 
pendent variable corresponding to the maximum ordinate (that 
is, maximum frequency) of the frequency curve. Though this 
method may seem so crude as not to be useful, it commonly 
gives results as satisfactory as those obtained from the other 
methods described above. 


Weighted Averages 


Averages — particularly arithmetic and geometric means — 
are sometimes characterized as either (1) simple or (2) weighted. 
Simple averages are those obtained from values no one of 
which occurs more than once; weighted averages are those 
derived from values some of which enter more than once. 
The values appearing more than once are said to have been 
weighted in the calculation of the average. According to this 
notion, any average derived from a frequency distribution is 
weighted — the mid-values of the several classes influence the 
mean in proportion to their variable frequencies. Commonly, in 
weighting, the factors applied to the different values of the vari- 
able are rough estimates of relative importance; in fact, the sub- 
ject of weighting is significant chiefly in connection with the 
determination of such factors. But the principle remains the 
same: if some values of the variable have a multiple effect in 
the determination of the mean, the mean may be said to be 
weighted. 

The calculation of a weighted average introduces no new prin- 
ciples. Let us consider a simple illustrative case. Suppose 
fresh eggs are quoted in five different markets as follows: 32¢, 
a7¢, 30¢, 31¢, 25¢. If these prices are averaged as they stand, the 
resulting mean — 29¢ — is of the simple type. But suppose that 
it appears that the number of dozens of eggs sold at the several 
prices in the five markets are 25, 30, 15, 35, 805, respectively. 
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This is a total sale of 1000 dozen. The average price per dozen 
at which these 1000 dozen have been sold clearly is not the simple 
mean of the five quotations, but their weighted mean, determined 
as follows: 


TABLE 47. CALCULATION OF WEIGHTED AVERAGE 


(Prices of Eggs, and Quantities Sold) 


No. or Dozen 


PRICE PER DozEN Sonus) gu Saas 


Price Times QUANTITY 


32¢ a $8.00 
27 30 8.10, 
3° 15 4.50 
31 35 10.85 
25 _ 895 223.75 
1000 $255.20 


The average price is thus RSS or 25.52¢ — a result very dif- 
1000 


ferent from that obtained without weighting. 

In the averaging of relatives, the question of weighting is 
always to be considered. Associated rates and ratios commonly 
refer to masses of different size. Take, for example, the general 
death rates of the counties of Massachusetts. These cannot 
be directly averaged to obtain a general death rate for the state 
as a whole unless each county rate is weighted proportionately 
to the population of the county. Only so can necessary allow- 
ance be made for wide differences in the populations to which the 
different rates actually apply. The problem of weighting requires 
particular caution in the averaging of relative numbers. 

In later connections, careful attention will have to be given 
to the conditions under which weighting is desirable, and to the 
methods by which it is to be effected for certain specified pur- 
poses. It is enough to recognize at this point, however, that 
weighting does not in general alter the essential differences 
among the averages nor involve in their calculation any steps 
not already. considered. 
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B. THE SELECTION AND USE OF AVERAGES 


In the previous section, the five most common averages of 
statistical analysis were briefly explained, and the methods by 
which they are to be obtained, indicated. It is the purpose of 
the present section to consider the merits and defects of each of 
the five averages, and to suggest the conditions under which the 
use of some of the averages seems preferable to the use of others.! 

In the first place, all averages are to be judged with reference 
to certain general criteria. The relative importance of these 
criteria varies from study to study; in a particular investigation, 
some of them may be of negligible significance. ‘There is, never- - 
theless, a presumption in favor of certain qualities in an average ; 
and if these common virtues are ignored it must be with good 
reason. 

In general, we may safely cite the following characteristics as 
pertaining to a good average :” 


Lak be of definite and certain value, 7.e. with any 
given set of items there should not be the slightest question 


as to the value of the average. This characteristic is some- 
times referred to as rigidity of definition. 


2. It should be simple_and-concrete. The reader should not 
be expected, because of the complexity of the average, to 
take its character on faith. 


3. The average should be readily calculable. In estimating 
time required for the computation of averages, it is proper 
in most instances to assume the employment of various 
labor-saving devices, such as tables of logarithms, slide 
rules, and calculating machines, since these are ordinarily 
available for any considerable statistical investigation. 


4. It is commonly desirable that the average lend itself 
readily to” further algebraic treatment. Determination 
of the average is ordinarily but one step in the general 
analysis of the variable’s characteristics and relationships. 
If the analysis is to be made exhaustive, the average will 


1The use of averages in the construction of index numbers will be discussed in a later 


chapter. 
2 See Yule, Introduction to the Theory of Statistics, pp. 107-108, and Bowley, Elements of 


Statistics, pp. 108-109. 
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have to be introduced in connection with other phases of 
the analysis, and this is difficult unless the averages possess 
definite mathematical form. 


5. The average should possess stability, that is, it should not 
(be seriously influenced by small errors in the data, by fluc- 
tuations in the items due to sampling, or by slight changes 

in the methods of computation. 


To what extent does each of the averages possess this combina- 
tion of desirable properties ? 

With regard to rigidity of definition it may be said that all three 
of the calculated means have a marked advantage over the 
median and crude mode. Location of the median, even from a 
full array of original items, rests, whenever there is an even 
number of items, on a mere convention; namely, that the arith- 
metic mean of the two middle items be taken. Happily 
this particular convention has already found general acceptance. 
The conventions involved in interpolating for the median in fre- 
quency distributions, unfortunately, have not yet obtained similar 
universal recognition, with the result that different values for the 
median are sometimes obtained from the same original data. 
The crude mode, as we have seen, is uncertain in location under 
the best of conditions, and commonly is so indefinite as not to 
be usable. We may conclude, therefore, that the three calculated 
means have the good characteristic of being perfectly defined, 
whereas the median is in this respect somewhat defective, and 
the crude mode conspicuously so. 

Simplicity and concreteness of nature are exhibited in largest 
measure by the arithmetic mean, the mode, and the median. As 
stated previously, the arithmetic mean is understood by practi- 
cally all; the nature of the median is fairly easily explained in 
simple terms; and the mode is the average which is commonly 
in mind when one is thinking of the most typical, or common, or 
frequently encountered, value of the variable. All three of these 
averages have a marked advantage over either the geometric or 
harmonic means in the matter of simplicity and concreteness of 
nature. 
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Of the five averages, the most easily obtained are the arithmetic 
mean and the median. Perhaps the mode qualifies as well when 
it is clearly defined. The geometric and harmonic means, upon 
the other hand, are obtained only after considerably longer 
computations unless special computing devices are employed, 
and even then require somewhat more time for computation 
than the arithmetic mean and the median. In the matter of 
ease of calculation the two latter have a distinct advantage over 
all others. . 

All three of the calculated means possess definite mathematical 
character and lend themselves, of course, to further algebraic 
treatment. The median and crude mode, on the other hand, 
break down entirely in this respect. Since both are simple means 
of position, they virtually estop any further mathematical analysis 
entailing the use of the average. 

The arithmetic mean is perhaps the most stable of the five 
averages. It is little affected by fluctuations of sampling and 
by the introduction of slight errors, and remains essentially the 
same when the assumptions adopted in the course of its calcula- 
tion are varied in any reasonable fashion. The median also is a 
stable average. 

Review of the several averages in the light of the general cri- 
teria of a good average leads squarely to the conclusion that the 
arithmetic mean qualifies in a much larger number of particulars 
than any other of the five. We may therefore agree with Yule 
(Introduction, p. 123) that the arithmetic mean is the average to 
be employed in statistical analysis unless there are definite and 
sufficient reasons for resorting to the use of some other. 

But it must always be remembered that the most important 
general criteria to be applied to an average is fitness for the par- 
ticular purpose in hand. To quote Venn (Logic of Chance, p. 439) : 
“Selection must depend on the question of what particular inter- 
mediate value may be safely substituted for the actual! variety of 
values so far as the precise object in view is concerned.” Accu- 
rate and effective averaging calls for a nice adaptation of means 
toends. It is with regard to the uses to which averages are to be 
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put, that the characters of the different forms have to be more 
carefully scrutinized. 

In general, an average may be used in two fundamentally 
different ways: (1) as a substitute for the variable in a mathe- 
matical calculation; (2) as a summary characterization of the 
variable expressing its typical size. The full import of the dif- 
ference between these two uses must be clearly understood. 

Let us consider first the case of an average employed as a 
substitute for the detailed records of the variable in the cal- 
culation of some more general number — say, some significant 
aggregate or product. The problem in this case is to obtain a 
single figure which when substituted for the actual items will give 
the same results as would be obtained if all the detailed individual 
records were used. The average here is essentially a mathemati- 
cal tool; it may properly be called abstract. 

An illustration will serve to make the matter clearer. An 
office of the General Staff is charged with following the work of 
the Army Transport Fleet. It is desirable to know what uniform 
rate of performance by the same number of vessels as are in 
service would effect as large a movement of troops as is being 
actually accomplished by the several transports of the fleet with 
their variable rates of speed. The average ‘‘turnaround”’ of 
the vessels is here a substitute for the variable “turnarounds” 
actually recorded. The average is used in place of the variable 
in the calculation of a more generally significant figure. The 
average effects a valuable simplification of the analysis. It stands 
for the variable in the derivation of further statistical results. 

In contrast to the use of averages just cited, averages are 
employed at other times in the measurement of typical size, or 
central tendency, in the variable. If, for example, in the course 
of a laboratory experiment repeated measurements of the length 
of a metal bar under constant conditions show differences, an 
average of the measurements is in a highly significant sense 
typical of the length of the bar; in fact, it is the best estimate 
of the actual length of the bar. Similarly, the average height of 
the members of a college class is the most dependable character- 
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ization of this particular attribute of the group; the average 
typifies the group in regard to this feature. In cases of this sort 
the average is no longer a mere mathematical abstraction. Like 
the frequency distribution, it serves to characterize the variable. 
It does so by expressing the variable’s typical size. Averages 
employed for this purpose may be referred to as typical, or concrete, 
averages. 

Though abstract averages are in very common use, they are not 
as significant as concrete averages in this part of the present 
study — concerned, as it is, with the analysis of the attributive 
variable — as are typical averages. Abstract averages do not 
characterize the variable. Typical averages do so in a most 
important way. They represent the type, or central tendency, 
of the variable — one of its most highly important features. 
The remainder of this chapter will deal, therefore, with typical, 
or concrete, averages, intended to measure the typical size of the 
variable. 

The average which corresponds most closely to the concept of 
typical size of item is the mode; in fact, the mode ts the typical, 
or most common, size of item. Ideally, then, the use of the mode 
is indicated whenever analysis is designed to measure typical 
size of item. Unfortunately, however, as we have seen, the mode 
is in many respects lacking in the desirable properties of an aver- 
age. Consequently, analysis commonly resorts to some average 
other than the mode for the measurement of typical size. 

When the distribution of the variable is essentially symmetrical, 
the mode, median, and arithmetic mean practically coincide. 
Any one of the three, then, may be used in measuring the typical 
size of the item. As we have seen, however, the arithmetic mean 
has substantial advantages over either the median or the mode. 
Consequently, under these circumstances the use of the arithmetic 
mean in the measurement of typical size is to be preferred. 

Suppose, however, the distribution is asymmetrical. Under 
this condition, the mode, median, and arithmetic mean part 
company, the median staying more closely to the mode than the 


1A note at the end of the chapter offers a brief discussion of abstract averages. 
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arithmetic mean (see Chart 34, p. 149 supra). If asymmetry 
becomes marked, the arithmetic mean ceases to express the 
typical size of the item; the median serves considerably better 
as a measurement of typical size and may be near enough to the 
modal value of the variable to stand as an expression of the 
central tendency disclosed by the distribution. Asymmetry in 
the distribution calls for careful discrimination in the choice of 
the average for measurement of typical size of the variable. 

The existence of asymmetry may indicate the use of the geo- 
metric mean in the determination of typical size. If the variable 
is of such a character that large items are more likely to occur 
than small, the use of the geometric mean may give a more satis- 
factory measurement of typical size than would the arithmetic. 
The use of the geometric mean is definitely indicated when the 
variable differences in the size of the variable are to be regarded 
proportionately rather than absolutely. The price relatives 
presented in the table on page 68 are a case in point. A geo- 
metric mean of these price relatives gives their typical size more 
satisfactorily than would any other average. 

The existence of extreme items (that is, items extremely large 
or extremely small compared with the remainder of the distribu- 
tion), like asymmetry, influences the choice of the average in the 
measurement of typical size. All of the calculated means are 
unavoidably influenced by such extreme items; they may be so 
distorted as no longer to represent the variable satisfactorily. 
Thus, the presence of a millionaire in a small community affects 
any calculated mean of income in the community so as to make 
the mean unrepresentative of the typical size of income. Of 
course, the objectives of analysis may be such that full influence 
is to be given to extreme items of this sort; but commonly this 
is not the case. When it is not desirable to allow extreme items 
full influence, the use of the median or mode is indicated. These, 
as averages of position, are unaffected by the mere size of extreme 
items. 


1Tt might be added that the geometric mean, as contrasted with the arithmetic, reduces 
the influence of extremely large items and magnifies the influence of extremely small ones. 
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Another situation may suggest the use of an average of posi- 
tion; namely, a situation in which definite information regarding 
some values of the variable is lacking. Thus, the several values 
of the variable may be merely ranked, that is, not given quanti- 
tatively; or there may be unlimited terminal classes in the dis- 
tribution so that the values of the extreme items cannot be told 
even approximately. The median is undisturbed by these con- 
ditions, and may well be employed under such conditions in 
preference to any of the calculated averages- 

The different cases specifically considered cover most of the 
circumstances under which the use of particular averages is 
explicitly suggested. In the absence of such definite indications, 
it is generally safe to fall back on the presumption in favor of 
the arithmetic mean. This, upon the whole, is the most service- 
able average. In general, the importance of discrimination and 
good judgment in the use of averages can hardly be exaggerated. 
Averages constitute one of the commonest devices of statistical 
analysis. Furthermore, they are one of the most effective means 
of extracting significant conclusions from the original observations 
of the variable. Their use with discrimination should be con- 
stantly cultivated. 

It must always be remembered, however, that averages are 
easily subject to misleading use, if not actual abuse, and should, 
as a rule, be accompanied by supporting data. An abstract 
average may be set up as an accurate representation of the central 
tendency of a variable when, as a matter of fact, it does not ex- 
press the typical size at all, but serves as a pure mathematical 
abstract of the data. Examination of the original data will 
quickly expose such a situation. Typical, or concrete, averages 
are invariably to be checked against the frequency distributions 
from which they have been drawn, that the representativeness 
of the average in every case may be verified. 

Even when full representativeness is assured, the average 
tells only a partial story. No average gives any account of the 
details of variation. These are to be found in the original data, 
or in the array, or in the frequency distribution. Furthermore, 
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no average expresses any general characteristic of the variable 
other than that of typical size. Other characteristics have to be 
brought to light by other statistical devices. The average is a 
highly important summary figure, perhaps the most important 
single summary figure, but it rarely, if ever, serves by itself 
adequately to characterize the variable. 


Note on the Use of Abstract Averages 


Three different conditions may be distinguished in the use of 
abstract averages. These will be examined in turn. 

In the first place, a substitutional or abstract average may be 
desired when the swm of the items is the more inclusive figure to 
which interest attaches. Thus, in a campaign for the solicitation 
of funds, it may be of interest to know what average contribution 
would have yielded the amount which has been actually collected. 
The idea here is not to obtain the typical size of contribution but 
merely that contribution which, multiplied by the total number 
of contributions, will produce the same total sum. The abstract 
average to employ in this case obviously is the arithmetic mean. 
The arithmetic mean of several items multiplied by the number 
of items gives precisely the same total as is obtained by summing 
all the individual items. The use of the arithmetic mean in this 
fashion is so common that further comment is hardly required on 
this use of the abstract average. 

An abstract average may be employed, in the second place, in 
lieu of the variable items in the calculation of a product derived 
from the successive multiplication of the items of the variable. 
Thus, trust funds may have been allowed to accumulate over a 
period of twenty years, cumulation being at variable rates in the 
several different years of the period. The rates of the successive 
periods are really compounded, of course, in the final sum. A 
question may be raised concerning the average rate of return 
obtained in the trustees’ investment of the funds. A constant 
rate of return is to be found which, if prevailing over the twenty- 
year period, would have yielded the same final fund. In this 
instance, the average to be taken is the geometric mean. The 
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geometric mean of the several items, raised to a power correspond- 
ing to the number of items, gives the same product as is obtained 
from successive multiplication of the individual items themselves. 

A third case calling for specific consideration is represented in 
the example of the ‘‘turnarounds” of the army transports. If 
interest attaches to the accomplishment of a given amount of 
work, and the rates of performance at which the work is done by 
several individuals are stated in terms of the time required for the 
accomplishment of the specified task, the average rate — so 
stated — at which the task must be accomplished in order to 
assure the performance of a given aggregate of work, will be the 
harmonic mean of the rates as given. This follows from the fact 
that the desired figure is dependent upon an average of the number 
of completed tasks which each of the several variable performers 
succeed in doing in any stated period of time. This number of 
completed tasks is reciprocally related to the time required by 
each for the completion of the task. 

The case may be illustrated concretely. Suppose three vessels 
are plying over a certain overseas route. Vessel A makes the 
round trip in 25 days, vessel B in the same time, and vessel C in 
5odays. What average “turnaround” — meaning by turnaround 
the time occupied to complete a trip — will give the same number 
of trips in 100 days as would be made by the three vessels travel- 
ing at their respective speeds? It might be thought that the 
answer would be the arithmetic average of 25, 25, and 50, or 333 
days. But with this turnaround, each vessel would make three 
trips during 100 days and the fleet as a whole, 9 trips during the 
days, whereas, it is clear that A and B can each make four round 
trips in 100 days, and C two such trips — a total of 10 completed 
voyages. In fact, then, each vessel must make, on the average, 
34 trips, or operate with a turnaround of 30 days (the harmonic 
mean of 25, 25, and 50) if the three together are to complete as 
many trips as they will with their different actual speeds. 

The harmonic mean is always required in the averaging of 
variable rates of performance, when these rates are stated in 
terms of the time required for the accomplishment of a given 
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task and the object is to obtain an average rate which assures the 
same aggregate accomplishment in a given period of time. The 
total accomplishment of each individual during any stated period 
is reciprocally related to the time required for the completion of 
the given task. Whenever this reciprocal relationship is involved 
in the problem and an abstract average is being taken, the har- 
monic mean should be employed. 


CHAT TE RAX 


DISPERSION 


As indicated at length in the previous chapter, averages charac- 
terize the variable by expressing its typical size or “‘central 
tendency.” In this réle, the average serves as the most con- 
centrated description of the variable; no other single figure so 
significantly summarizes the variable. But it should always be 
remembered that the average carries the process of summariza- 
tion so far that all details of the image of the variable except its 
typical size are lost, and that, consequently, it is well ordinarily 
to supplement the average with additional statistical devices 
designed to give information regarding the nature of the variable 
which no average by itself can convey. 

The desirability of supporting the average with certain other 
summary figures becomes evident if we consider the situation 
represented by the two frequency distributions shown in Chart 35. 
Averages drawn from the two distributions of this figure would be 
identical whether the average were computed as an arithmetic 
mean, a median, oramode. Yet itis clear that the two distribu- 
tions are in at least one respect very different — they differ 
decidedly in their spread. No statement of the average size of a 
variable conveys any idea regarding the extent or the nature of 
the scattering of the items. Certain supplementary figures are 
required if the variable is to be characterized completely. 


A. VARIABILITY, OR THE EXTENT OF DISPERSION 


Among these supplementary figures perhaps the most impor- 
tant is the measure of variability. A measure of variability or 
dispersion is designed to state the extent to which the individual 


items of a frequency distribution differ on the average from the 
163 


164 STATISTICAL ANALYSIS 


mean in terms of which the most characteristic size of the variable 
has been stated. In other words, a measure of dispersion 
indicates how typical the average is of the distribution which it 
has been employed to summarize. In the case of repeated 
measurements of the same object, the measure of dispersion 
ordinarily indicates the inaccuracy of observation. In other 
cases, the dispersion of the individual items reflects the vari- 
ability of the observed attribute. 


CHART 35. FREQUENCY DISTRIBUTIONS OF IDENTICAL MEANS BUT 
UNEQUAL DISPERSION 


M 


Dispersion may be measured by a number of different statistical 
devices. Those in most common use are (1) the average devia- 
tion; (2) the standard deviation; (3) the quartile deviation ; 
(4) deciles and percentiles! Following the procedure adopted in 


1 Sometimes the range of the variable (namely, the difference between the sizes of the 
largest and smallest items) is regarded as a crude measure of dispersion. As a matter of 
fact, however, the range as a measure of dispersion has no merit save its simplicity. It is 
determined entirely by the position of the two terminal items and consequently does not 
reflect at all the character of the distribution for that portion of the scale of the variable 
within which the bulk of the items fall. Furthermore, since the terminal items are from the 
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the examination of averages, the nature of each of these measures 
of dispersion will first be considered, and then the methods by 
which the measure is to be obtained. The advantages and dis- 
advantages of each and the circumstances under which the use 
of each measure is appropriate, will finally be analyzed.! 


Average Deviation 


The average deviation is in some ways the most obviously 
logical of all the measures of dispersion. After all, the problem 
of obtaining a measure of dispersion is a problem of extracting a 
satisfactory average from a set of derived items expressing the 
deviation of each original item from the mean of the distribution.” 
The deviations of the items may be measured from the arithmetic 
mean, the median, or the mode, though theoretically, there is 
something to be said for taking the deviations always from the 
median in the calculation of the average deviation.3 The average 
deviation is ordinarily the simple arithmetic mean of the devia- 
tion items, signs being ignored: in symbolic form, ae j 

The computation of the average deviation may be fairly easily 
carried through from the actual deviation items. The steps 
involved in such a computation are sufficiently indicated in the 
details of Table 48. 


very nature of the case likely to be stray items, their values are almost invariably erratic. 
The result is that the range itself becomes uncertain and erratic. In short, the range is 
not to be seriously regarded as a measure of dispersion. 
1 A simple set of symbols will be used to denote the different deviation measures : 
Average deviation = Da 
Standard deviation = ¢ 
Quartile deviation = Dg 


2Tn the notation previously adopted, the deviation items run as follows: Xi — M, 
X2—M, ...X,—M, where M is an average of all the X’s. The deviation items may be 
conveniently represented by «1, v2 . . . %n, where x1 = X1 — M, x2 = X2—M, etc. The 
values given by these expressions obviously will be in terms of the same unit as was em- 
ployed in stating the values of the original items. Some of the deviation items clearly will be 
positive, others negative. If M is the arithmetic mean — in other words, if deviations have 
been measured from the arithmetic mean — the algebraic sum of all the deviation items 


We: ae will be zero, the sum of the negative items just offsetting the sum of the positive. 


’ This theoretical point derives from the fact that the absolute sum of the deviations from 
the median is a minimum. 
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TaBLE 48. CALCULATION OF AVERAGE DEVIATION FROM FREQUENCY 
DISTRIBUTION, USING DEVIATIONS FROM MEDIAN 


(Percentage of Members Unemployed in 120 Trade Unions) 


DEVIATION TIMES 


F DEVIATION FROM 
See i ee ieee bere 

0.0—0.9 5 =) 2.0 Io 
I.0—1.9 36 aero 36 
2.0—2.0 37 0.0 ° 
3.0—3.9 17 1.0 a7 
4.0—4.9 Io 2.0 20 
5.0—5.9 7 3-0 7A 
6.0—6.9 5 4.0 20 
TO 7.0 I 5.0 5 
8.0—8.9 I 6.0 6 
9.0—-9.9 set 7.0 ai 

120 142 


Da = 4$2= 1.18 or 1.2% 


The computation of the average deviation may ordinarily be 
slightly simplified by using as an arbitrary origin the mid-value 
of the class in which the median lies.1' In this case, a correction 
of the initial sum of the deviations is necessary in order to allow 
for the fact that the deviations have not been taken from the 
median. Examination of the difference between the median and 
the arbitrary origin —a difference which may be represented by 
d —will always make clear which deviation items have been taken 
too small and which, too large. If the number of deviation items 
which have been taken too small is represented by Nj, and the 
number which have been taken too large, by No, the necessary 
correction of the initial sum of the deviations is given by the ex- 
pression d(N, — Nz), where d is the absolute difference between 
M.and A. Care must be exercised to obtain the proper values 
for N, and Ne, but with these in hand, the correction of the first 
results is a simple matter. The complete computation is shown 
in connection with Table 49. It can readily be seen that com- 
putation by this method will ordinarily be much simpler than 
by that worked in terms of deviations from the median. 


1Tn the example just given the actual median happens to lie at the mid-value. 
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TABLE 49. CALCULATION OF AVERAGE DEVIATION FROM FREQUENCY 
DISTRIBUTION, USING DEVIATIONS FROM ARBITRARY ORIGIN 


(Percentage of Freight Traffic to Total Traffic on American Railroads) 


Mip-vawe or Pere | = Nowpra or | 0S" hrow Anurimany | Faequencys 
Oricin A (= 75%) (signs ignored) 

45 i 8 = 6 
5° 3 35 15 
55 3 —4 12 
60 6 == 2 18 
65 12 —2 24 
7° 23 —1 23 
75 31 ° ° 
80 18 I 18 
85 18 2 30 
go Io 3 30 
95 _ 6 4 24 

ren 206 


Total deviations from A: in intervals = 206 
in original units = 206 X 5 = 1030 
Correction for individual item, d = M, — A = 75.3 — 75.0 = .3 
Total correction = .3(N, — Nz), where N; is number of items for 
which deviation has been taken too small = 79 
and Nz is number of items for which deviation has been taken 
too large = 52 
d(Ny aa N2) = .3(79 — 52) = 5.5 
Total deviations from M, = 1030 + 8.1 = 1038.1 
idee 
jae! 


“Da = = 7:9% 


Standard Deviation 


The standard deviation, the second of the measures of disper- 
sion, is to be regarded as another mean of deviation items. In 
computing the average deviation, differences of signs in the devia- 
tion items are ignored. From a mathematical point of view, this 
high-handed treatment of the signs is quite objectionable, destroy- 
ing the mathematical integrity of the resultant figures. Yet 
ignoring the signs is unavoidable as long as an average is sought 
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of the deviation items in their first powers, since, if signs are 
recognized, the’ negative and positive deviation items tend to 
cancel one another and leave a negligible result. This difficulty 
may be overcome by first squaring the deviation items, in which 
case all become positive as to sign, then obtaining the mean of 
these deviation items squared, and finally taking the square root 
of this result. In algebraic notation, the standard deviation — 


2 
ordinarily represented by ¢ —is Ve In obtaining a, devia- 


tions are always taken from the arithmetic mean. Thus set up, 
the standard deviation is a form of average of the deviation items.” 
Computation of the standard deviation may be carried through 
in terms of the squares of the true deviation items. This direct 
mode of calculation is illustrated in Table 50. 
TABLE 50. CALCULATION OF STANDARD DEVIATION FROM FREQUENCY 
DistrisuTion: Lone MrtHop 


(Percentage of Freight Traffic to Total Traffic on American Railroads) 


Mip-vALUE OF NUMBER OF Dn DEVIATION SQUARED DEVIATION 
PERCENTAGE CLASS RAILROADS Ga) # SQUARED Times FREQUENCY 
a Si eae — (ai — f= 
45 I 93014) 924.16 924.16 
50 3 = 25.4 - 645.16 1935.48 
gS 3 — 20.4 416.16 1248.48 
60 6 —— 15-4 237.16 1422.96 
65 12 Ord! 108.16 1297.92 
70 23 = Gy 29.16 670.68 
75 31 = ofl .16 4.96 
80 18 + 4.6 21.16 380.88 
85 18 ap CHO 92.16 1658.88 
90 ie) + 14.6 213.16 2131.60 
95 _ 6 + 19.6 384.16 2304.96 
131 13980.96 
= 13980.96 eee 
c= ~~ = V 106.71 = 10.3% 


131 


1 Examination of the formula will indicate why the standard deviation is sometimes 
referred to as the root-mean-square deviation. 

2 Under conditions of a fairly normal distribution of the original values of the variable, 
the average deviation will commonly be about four fifths of the standard. Six times o will 
cover 99+ per cent of all the observations in a symmetrical or moderately asymmetrical 
distribution. 
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A much simpler method of calculating the standard deviation, 
however, is available. An arbitrary origin is employed. If the 
mean-square of the deviations from this arbitrary origin is com- 
puted, and from this result the square of the difference between 
the arbitrary origin and the mean subtracted, the result is the 
square of the true standard deviation.! The computation may 
be simplified by using the class interval as the unit; it may be 
checked by recomputing o from a different arbitrary origin. 
The method is illustrated in Table 51. 


TABLE 51. CALCULATION OF STANDARD DEVIATION FROM FREQUENCY 
DISTRIBUTION: SHORT-cUT METHOD 


(Percentage of Freight Traffic to Total Traffic on American Railroads) 


MiID-VALUE OF DEVIATION FROM SQUARED DEVIA- 
N OF DEVIATION TIMES 
PERCENTACE | Rariaonns te ees |  BREQUENCY | Ftmouawee 
mee are eee =i Eat —= fe )P— 
45 I = 6 = © 36 
5° 3 "5 om oy 75 
55 3 aoe a 12 48 
60 6 = x => ait 54 
65 12 =2 yh, 48 
70 23 -—I1 — 23 — 98 23 
75 31 ° (e) fo) 
80 18 +1 18 18 
85 18 4 2 36 72 
go Io ae 30 (ere) 
95 6 ain 24 + 108 96 
131 -- I0 560 
ys / 
x Fi @) . 
C=M,—A= A) SNe .0763 (interval) 
N 131 
c? = .0058 (interval) 
ZAG 60 : 
Soe eo 4.275 (intervals) 


o = S?—c = 4.275 — .00s8 = 4.269 (intervals) 
g = V 4.269 = 2.06 (intervals) 
2920095077, = 10.3% 


1 Algebraically the relationship is given as follows: 
GS 


_ 2) 


Q 
| 


(x)? 


where S = and Cage — A 
N 


x1’ ae! 23/ +++ x,’ being the deviations of X1 X2 X3 ++: X, from the arbitrary origin, A. 
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Quartile Deviation 


The quartile deviation is a measure of a different sort, resting 
not upon a computation from the deviation items, but rather 
upon a determination of the position of two designated items in 
the frequency distribution. These items are the first and third 
quartiles. 

The nature of the quartiles requires brief explanation. They 
are closely related to the median. Just as the median was 
defined as that position on the scale of the variable which divides 
the items arranged in the order of magnitude into two equal parts, 
the quartiles are to be thought of as those positions on the scale 
which divide the items arranged in the order of magnitude into 
four equal parts. There are obviously three of these positions or 
quartiles. The one toward the lower values of the scale of the 
variable is referred to as the first, the one toward the upper range 
of the variable as the third. The second, or middle, quartile 
clearly is the median. 

The deviations of the first and third quartiles from the median 
afford simple evidence of the extent of dispersion in the distribu- 
tion. ‘The greater these deviations, the greater the dispersion in 
the distribution. Clearly, however, there is no reason for taking 
as a simple measure of dispersion the difference between the 
median and the first quartile in preference to the difference 
between the third quartile and the median. An average of the 
two consequently is struck. If the first quartile is represented 
by Qi, the median by M_,, and the third quartile by Q3, the quar- 
tile measure of deviation is given as follows: 


Pe = Ost ee Me) Cries 


2 2 


Examination of this expression will quickly show the reason for 
the technical name which is commonly given the quartile measure, 
namely, the semi-interquartile range.' 


1 The quartile deviation will ordinarily be about two thirds of the standard deviation if 
the distribution from which it is obtained is symmetrical or only moderately asymmetrical. 
The mean deviation under the same conditions will be about four fifths of the standard 
deviation. 
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Calculation of the quartile deviation involves practically 
nothing but the location of the quartiles. This is accomplished 
by methods like those employed previously in computing the 
median. The only changes necessary in the formula are the 
substitution of N/4 for the first quartile, and 3 N/4 for the third 
quartile, in place of the N/2 employed in locating the median. 
The methods will be clear after examination of the illustrative 
case given in Table 52 and Chart 36. 


TABLE 52. LOCATION OF QUARTILES IN FREQUENCY DISTRIBUTION 


(Percentage of Freight Traffic to Total Traffic on 131 American 
Railroads in 1920) 


Mip-vALUE OF PER- NUMBER OF 
CENTAGE CLASS RAILROADS 
45 t Nee rsa 
§° RS) ae > or a yO A) 
5) 3 
60 6 ce eg SES 
65 12 4 4 
70 23 Qi = 67.5 + 7.75 (zs) 
75 31 = 69.2 
80 18 Q3 = 82.5 + 1.25(q5) 
85 PTS = 82.8 
90 5 Ke) 
95 6 
toa 


i QO; — O1 bes 82.8 :'— 69.2 = 6.8% 
2 2 


Deciles and Percentiles 


Deciles are those positions on the scale of the variable which 
divide the distribution into ten equal parts. Clearly there are 
nine such values in each distribution. Deciles measure dis- 
persion in much the same way that it is measured by the quartiles, 
giving the picture, however, in greater detail. No distinct 
decile measure of dispersion is recognized — rather the deciles 
themselves are cited in full in evidence of the extent and character 
of the dispersion, 
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Percentiles subdivide the array into one hundred equal parts. 
Thus they carry the principle of quartiles and deciles to even 
greater detail. However, it is rarely, if ever, necessary to have 
more detail than is given by the deciles, and the use of percentiles 
consequently is altogether exceptional in the statistical analysis 
of economic and business data. 


Cuart 36. LOCATION OF QUARTILES FROM CUMULATIVE FREQUENCY CURVE 


(Data of Table 52) 


No. of 
Railroads 


130 


100 


SO 


20% 50 60 Qi Me 60 Q3 90 100% 
692 75.3 628 

Deciles and percentiles are computed by the means already 
explained for the median and quartiles. They may be marked 
off readily in any full array of items, may be interpolated-in the 
classes of frequency distributions by formule devised on the 
lines of the formula given in Chapter X for the median, or may 
be readily obtained from the cumulative frequency graph. 
Since, in the measurement of dispersion, neither deciles nor per- 
centiles differ in principle from quartiles, they need not be further 
considered in the discussion of the advantages and disadvantages 
of the several measures, 
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The Selection and Use of Indices of Dispersion 


The several measures of dispersion may be tested by the same 
criteria as were applied in the examination of averages! It is 
desirable that a measure of dispersion be rigidly defined, simple 
and not unnecessarily abstract, easily obtainable, of such char- 
acter as to lend itself to further algebraic treatment, and rela- 
tively stable under the conditions of sampling. Judged by these 
tests, what is to be said of the relative merit of the three com- 
monly employed measures of dispersion: the average, or mean, 
deviation, standard deviation, and quartile deviation? 

The mean deviation has the advantage of possessing perhaps 
the simplest nature of all. From a logical point of view, it seems 
to be the most obvious characterization of the typical deviation 
of items. It is rigidly defined and is computed with comparative 
ease. Its one serious defect is that mathematically it does vio- 
lence to the character of the deviation items in that differences 
of sign are completely ignored. Because of this fact, use of the 
average deviation estops all subsequent mathematical treatment. 
Were it not for this difficulty, the average deviation might be 
more widely employed; as it is, it is distinctly inferior to the 
standard deviation. 

The latter is somewhat more abstract than the mean devia- 
tion, not so readily understood by those unacquainted with 
mathematical technique, and not quite as easily computed as the 
mean deviation. Nevertheless, it meets most of the tests of a 
satisfactory statistical average. It is rigidly defined, calculated 
with comparative ease, relatively stable, and lends itself to further 
mathematical treatment. There would seem to be, therefore, 
good reason for subscribing to Yule’s dictum with reference to 
the standard deviation: “The standard deviation should always 
be used as the measure of dispersion, unless there is some very 
definite reason for preferring another measure.” ” 

Though this rule would seem to govern the choice of a measure 
of dispersion, there are times when the quartile measure may be 


1 See pages 153-154. 
2 See Introduction to the Theory of Statistics, p. 144. 
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satisfactorily employed. The quartile measure is, after all, a 
simple device, readily understood, and obtained without serious 
difficulty. It has the fundamental limitation of all the position 
measures in that it does not lend itself to further mathematical 
treatment. Furthermore, it is at times, like the median, some- 
what lacking in rigidity of definition, this being particularly the 
case with discrete variables. Upon the whole, the use of the 
quartile measure is hardly to be recommended save when the 
median is employed as the measure of typical size, in which 
case the quartiles would seem to be logically indicated in the 
measurement of dispersion. 

All three of the measures of dispersion, in the forms thus far 
considered, are absolute numbers carrying the same unit as the 
original items from which the measures have been obtained. 
Under most, though not all, conditions, however, dispersion is 
to be regarded relatively, not absolutely. The base to which 
the absolute amount of dispersion is to be referred is presumably 
the typical size of the items in the distribution. Deviation 
measures may be divided, in other words, by corresponding aver- 
ages. Results obtained in this way may be referred to as coef- 
ficients of dispersion, in contrast to measures of dispersion. In 
deriving the coefficients from the measures, the measures should 
be divided by the means from which the deviations have been cal- 


culated. Thus - may be regarded as the best mean coefficient 


é 


o 


of dispersion : as the best standard coefficient of dispersion,} 


a 


Q;-Q: Q3—-Q1 
Pel REEL Meee Oe 8) 
M, Q;3+Q1 Q3 + Qi 
2 


as the best quartile coefficient of dispersion. 


1 Karl Pearson has suggested one hundred times the standard deviation divided by the 


° 100 0 . Bere 
mean (that is, aya as a “coefficient of variation.” 


a 
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The relationship of the various measures and coefficients to 
one another may be shown by giving in parallel columns the 
absolute and relative dispersion values for the data of Table 52. 
This is done in Table 53. 


TABLE 53. COMPARISON OF MEASURES AND COEFFICIENTS OF DISPERSION 


(Data of Table 52) 


VALUE OF 
Type or MEASURE 
MEASURE COEFFICIENT 
Dy 7.9% .10 
o 10.3 14 
Dy 6.8 .09 


Whether measured absolutely or relatively, the extent of the 
dispersion of the variable constitutes an important part of its 
characterization. An accurate measurement of variability sup- 
plements the earlier determination of typical size. The two 
together cover the most important summary features of the 
variable. 


B. ASYMMETRY OR SKEWNESS 


In the preceding section, ways of measuring the amount of 
dispersion were considered. Dispersion may also be examined 
with reference to the extent to which items throw unevenly to 
one side or the other of the typical item in the distribution. If 
the items are similarly dispersed on both sides of the mode, the 
distribution is said to be symmetrical; otherwise, asymmetrical, 
or skewed. In the present section, the analysis of dispersion is 
with reference to its asymmetry or skewness. 

That this may be an important consideration in the analysis 
of a frequency distribution is obvious from the fact that two 
distributions may exhibit the same typical size of item and dis- 
close the same amount of dispersion and yet be entirely different 
in another important respect. The two graphs shown in Chart 37 
illustrate the point. 
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The modes of the two graphs coincide at M,, and the standard 
deviations of the two are equal. Yet the distributions differ 
fundamentally in that one tapers off to the lower side of the mode 
and the other to the upper side. Two other distributions might 
skew in the same direction but by different amounts. Differ- 
ences of this sort are to be disclosed through the measurement of 
asymmetry. Analysis along this line should constitute a part of 
any comprehensive comparison of two distributions of markedly 
different forms. 


CHART 37. FREQUENCY DISTRIBUTIONS OF IDENTICAL MODES AND 
VARIABILITY BUT UNLIKE ASYMMETRY 


A number of different measures of asymmetry have been 
employed. Only two, however, call for careful consideration 
in the present study. These are measures based upon (1) the 
difference between the mean and the mode (or median), and 
(2) the difference in the distances of the first and third quartiles 
from the median.’ The differences in both cases are ordinarily 
reduced to relative form by dividing through by some measure 
of the amount of dispersion. This would seem to be logically 
desirable owing to the fact that the possibility of asymmetry 


1See King, Elements of Statistical Method, pp. 162-166, for other measures of dispersion. 
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is really dependent upon the amount of dispersion in the distri- 
bution. Reduced to relative or coefficient form, the first of the 
two measures of asymmetry is commonly called the Pearsonian, 
the second, the quartile, measure. The two may be briefly 
explained. 

The Pearsonian measure, as already noted, rests upon the fact 
that when any distribution is markedly asymmetric, the arith- 
metic mean and the mode part company. They separate more 
and more as the distribution becomes more and more asymmetric. 
Consequently, the difference between the arithmetic mean and 
the mode may be made to serve as a measure of asymmetry. 
The difference is divided by the standard deviation as a measure 
of the extent of dispersion.1_ The formula for the measure is thus 


_M.-M, 


o 


Sp 


Since the terms here are familiar, no new questions arise re- 
garding the computations. The nature of the measure may be 
illustrated by reference to the data already given in Table 32. 


Se om 

The other common measure of asymmetry —the quartile 
measure — rests upon the fact that if the distribution is skewed, 
Q; — M, differs from M, — Q;. If we take the difference be- 
tween these two, then, we have a measure of asymmetry. Put- 
ting the expression in coefficient form, by dividing by the quartile 
measure of deviation, we obtain for the quartile coefficient of 
asymmetry the expression ” 


‘s Om a 01) Or 0s 2 
q io), in 


The coefficient of asymmetry in this case is 


1JInstead of o, some other measure of dispersion might be employed, say, the average 
deviation from the mode. 

2 Though the quartile coefficient may assume values as high as + 2 and as low as — 2 
in practice it rarely goes beyond + 1 and — 1. If the distribution is symmetrical, the ex- 
pression clearly becomes o. In the illustrative example shown for the Pearsonian coefficient, 
69.2 + 82.8 — 150.6 =e 


the quartile coefficient is equal to ae 
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Both the Pearsonian and quartile coefficients of asymmetry 
carry a definite sign; that is, the value of the coefficient is either 
definitely positive or definitely negative (provided, of course, 
itisnoto). Asymmetry ina distribution must always be regarded 
as either positive or negative. It follows from the form in which 
the measures have already been given that if the mode exceeds 
the arithmetic mean the coefficient is negative. This is equiva- 
lent to saying that when the distribution tails off toward the lower 
end of the scale, it is negatively skewed; when it tails off toward 
the upper end of the scale, it is positively skewed." 


TABLE 54. FREQUENCY DISTRIBUTION OF NUMBER IN LITTER 
AMONG LITTERS oF MICE 


NuMBER IN LITTER NuMBER OF LITTERS 
I 7 
2 bi 
3 16 
4 17 
5 26 
6 Bit 
7 Il 
8 i 
9 I 


The relative merits of the two measures of asymmetry are not 
altogether clear. The Pearsonian is somewhat more sensitive 
to slight differences in moderately asymmetrical distributions. 
The difficulty with the Pearsonian measure, however, lies in the 
fact that the mode is at times ill-defined.2 The median may be 
substituted for the mode in the coefficient since, like the mode, it 
is an average of position not as much affected by the dispersion 
of the distribution as is the arithmetic mean. Unfortunately, 
however, the median is somewhat influenced by asymmetry and 
under conditions of moderate asymmetry often differs only 

1Tn the chart at the beginning of this section, curve A is negatively asymmetric, curve 
B positively asymmetric. In general, positive asymmetry is much more common than 
negative. An interesting biological illustration of negative skewness is given in Table 54. 


* The method of determining the mode from the expression Mg —3 (Ma — M.) appears 
to give results too uncertain for use in measuring asymmetry. 
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slightly from the arithmetic mean. The substitution of the 
median for the mode consequently does not lead to satisfactory 
results. If the mode is not well defined, the Pearsonian measure 
of asymmetry can hardly be employed. 

The quartile coefficient of asymmetry has the advantage of 
being fairly definite in character and obtainable without serious 
difficulty. It suffers from the defect, however, of not being as 
delicate a register of moderate asymmetry. When the dis- 
tributions being compared are only moderately asymmetric and 
their modes reasonably distinct, the Pearsonian measure is to 
be preferred. If, however, either of these conditions is not pres- 
ent, the quartile coefficient of asymmetry is practically the only 
simple device available for the measurement of skewness. 


In variability and asymmetry the way in which items scatter 
around the typical size of the variable is significantly character- 
ized. Variability is the more important feature of the dispersion, 
but asymmetry at times constitutes a striking point of demarca- 
tion among variables otherwise resembling one another. The 
general picture of the variable is completely drawn only in a 
properly constructed frequency distribution, a wisely selected 
average, and appropriate measures of variability and asymmetry. 
Comparison of variables along these lines is the basis upon which 
rest many important statistical deductions. 


ANALYSIS OF PAIRED VARIABLES: 
CORRELATION 


CHAPTER XII 
THE MEANING OF CORRELATION 


Tuus far we have been considering the analysis of individual 
variables, and the comparison of significant differences dis- 
covered by this means. Simple comparison sometimes uncovers 
striking differences or contrasts, and is suggestive of funda- 
mental relationships among the variables. But complete demon- 
stration of the existence and precise nature of these relationships 
requires something more than the comparison of related vari- 
ables; a correlation analysis is necessary. Such an analysis 
depends upon definite conditions both in original observation 
and in subsequent analysis. Attention may first be directed 
toward the requisite conditions of original observation. 

In the simplest form of statistical returns, individual variables 
are reported without any attempt to link the observations of one 
variable with the observations of another. Thus, in two inde- 
pendent investigations, the ages and earnings of the workers in 
an industrial plant may be individually reported. The data of 
age and earnings may then be subjected to careful analysis and 
certain comparisons made. But the opportunities for effective 
analysis may be greatly enlarged if the observations of age and 
earnings are carefully linked together and the records made to 
show the age which is associated with given earnings in the case 
of each individual worker. 

Such a pairing of variables in individuals or objects is common. 
We may, for example, observe height and weight in a single 
individual; the ages of husband and wife in a single marriage ; 
the discount and dividend rates in a given bank; the volume of 
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business and the rate of net profit in a specified business; the 
quantity offered for sale and the price at which sale is effected in 
a given transaction. Such pairing of variables is to be recog- 
nized as a specific condition of observation, differing fundamen- 
tally from that which has been assumed up to this point in the 
analysis of statistical material.’ 

It is not to be thought that the pairing of variables must always 
be through their appearance in the same object. The pairing 
of variables may be through like location in space. Thus, 
observed rainfall and crop results may be said to be paired 
when they relate to the same farm areas. Similarly, variables 
may be paired through their relationship in time. Imports 
of silk may be paired with the domestic price of silk in the 
same period, or of a period preceding or following by some 
uniform span of time. The pairing of observations among vari- 
ables may, in short, be on three different bases: (1) objective; 
(2) spatial; and (3) temporal. In all three cases, the funda- 
mental fact is that observed magnitudes in one variable are 
linked in some definite and uniform fashion to observed magni- 
tudes in another variable.?, Perhaps the simplest notion of pair- 
ing, however, is to be had from the consideration of two variables 
linked together objectively ; that is, in the same object. Expla- 
nation of the analysis of paired observations will be made, there- 
fore, with special reference to this simplest condition.’ 

As a first step in the analysis of paired observations, it is in- 
structive to consider what is commonly called a scatter diagram. 
Such a scatter diagram is readily constructed if a codrdinate field 
is laid out, with the scale of the « variable along the horizontal 
axis and the scale of the y variable along the vertical axis, and 
every pair of observations plotted as a point so located on the 
grid as to represent the combination of magnitudes recorded in 
the observations. In Chart 38 a scatter diagram constructed 


1Jn accord with the notation already made, the paired observations may be represented 
by (X41, Y1), (X2, Y2) oe OG» Wea). 

2Tt is not necessary to assume that the condition of pairing is confined to two variables. 
Pairing may hold for three or more. 

5 Such differences in the technique of analysis as follow from the nature of the pairing 
will be considered in later chapters. 
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on this plan is shown for the paired observations of two variables 
given for forty-six states in 1922; viz. (1) the number of auto- 
mobiles registered per capita, (2) the total wealth per capita. 
The plot in the lower left-hand corner of the chart stands for a 
state in which the per capita number of cars was .038 and the 
per capita wealth, $1244. The other plots correspond to similar 
combinations of the two variables. The whole figure constitutes 
a simple scatter diagram. 


Cuart 38. ScatTreR DIAGRAM OF WEALTH PER CAPITA AND AUTOMOBILE 
REGISTRATIONS PER CAPITA IN 46 STATES 


Number of 
Cars per 
Capita 

P2225 
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lig 
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025) 
1000 1500 2000 2500 3000 3500 4000 4500 5000 
Wealth per Capita [in dollars) 

Ordinarily, unless the number of paired observations be small, say 
less than 30, the correlation analysis begins with a summary tabu- 
lation of the data in a two-way frequency, or correlation, table. 

1A two-way frequency table in which the classifications are qualitative, is commonly 


called a contingency table, in distinction from the correlation table in which the sake. 
tions are quantitative. 
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In Table 55 the data represented in the scatter diagram just 
shown are cast into such a correlation table. 


TABLE 55. CORRELATIVE DISTRIBUTION OF (A) PER CAprTrA WEALTH, AND 
(B) Per CaprirA REGISTRATION OF AUTOMOBILES, AMONG 46 STATES 
IN 19221 


WEALTH PER Capita (in dollars) 
NUMBER OF 


CARS PER 

Carita 1000- | 1500— | 2000- | 2500—- | 3000- | 3500— | 4000- | 4500— All 
1499 1999 2499 2999 3499 3999 4499 4999 

.200-.224 I I 
-175—-199 I 2 3 
.I50-.174 2 2 2 I I 8 
-125—.149 I 2 3 
-100—.1 24 I 3 4 5 4 17 
075~-099 3 3 
.050-.074 3 3 2 8 
025.049 3 3 
All 6 4 5 i 13 6 4 I 46 


Sources: Department of Commerce, Bureau of the Census, Estimated National Wealth, 
1922, pp. 28-29; National Automobile Chamber of Commerce, Facts and Figures, 1924, 
p- 38. 


Examination of the table should make clear its general char- 
acter. It may be obtained from the scatter diagram by sub- 
dividing the codrdinate area into equal rectangular boxes or 
compartments, and then counting and entering the number of 
cases that fall within each. Each row and each column of the 
table is a simple frequency distribution of the items falling within 
that section of the scatter diagram. The table as a whole serves 
to summarize the data and to bring the paired observations into 
a form convenient for subsequent analysis. Tables 56 and 57 
afford additional illustrations of the same type of tabulation. 


1 California, with $4007 wealth, and .251 cars, per capita, and Nevada, with $6908 
wealth, and .157 cars, per capita, are omitted. 
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TABLE 56. CORRELATIVE DISTRIBUTION OF THE QUANTITY OF OUTPUT AND THE 
SizE oF LABOR Force IN’ Coat Mines DELivertNG More THAN 100,000 


TONS DURING GIVEN YEAR 


NuMBER OF MEN EMPLOYED 


AVERAGE 
Datty 
Propucrion | 80 |120|160|200/240|280/320|360/ 400/440) 480) 520|560/600/640)/ 680 
(in tons) | To | To | To | To | To | TO; To | TO | TO | TO | To | TO | TO | TO | TO | TO 
119/159)/199,239/279/319|359/399/439/479|/519/559)599|639/679/719 
3400-3599 I 
3200~3399 i; TG ake 2 ee 
Pee otal a are bol pol nena 
‘2800-2909 | | | cod Halo aa 
2600-2799 a hal tak ae 
4oo-2500| | | | lence ee tian. 
2200-2399 ag ay I 2 as 
"2000-2100 © “al s] ala [a ot 
1800-1999 2 4 Bile a I nok 
1600-1799 se Sedh agi S35 Bl oe ae arpa 
I400~15909 Eis As) 83) 3 B ee ae at a ae, 
1200-1399 7| 4 aS |) aG||, ie * I as, 
IOOO-1199 7 | 10 % 5 I ake ey ive 
800-909 A A 12 me 2 a eet I Ce sian ra ie 
600-799 PRS Ae ae all x ae I a 
400-599 | 3 | 4] 5 2 ms lex = a 
Total Bevan TONrss 26] 17|x4[ 16/32) 5 s|4]s[2]4]o 


Source: Thirtieth Annual Coal Report of Iilinois, rorr. 


In general, the preparation of a correlation table is governed 
by the same rules as are observed in the formation of a simple 
frequency distribution Only one slight modification of the 
procedure has to be recognized. It will be recalled that it was 
stated in giving rules for the formulation of a frequency dis- <= 


1 See Chapter VI. \ 


THE MEANING OF CORRELATION 185 


TABLE 57. CORRELATIVE DISTRIBUTION OF AGE AND WEEKLY EARNINGS AMONG 
CINCINNATI NEWSBOYS UNDER SIXTEEN YEARS OF AGE AND EARNING LEss 
THAN THREE DOLLARS A WEEK 


AGE (years completed) 


WEEKLY 
EARNINGS 
Weascests) Sloe |e 103) cia 12 13 14 | 15 | Total 
275—299 ° 
250-274 ie ae 2 ey 4 2 2 2) 12 
225-249 a ae oneal 4 2 4 Io 
200-224 Ty PPh es 4 6 2 8 2 22 
175-199 6 2 8 4 4 24 
150-174 id |) ee Io 16 16 a 18 6 80 
125-149 cai 2 4 6 6 4 eee 22 
100-124 ae 2 4 8 18 24 4 4 64 
75-99 oar 2 inane 20 m2 22 6 4 72 
50-74 ‘i 4 26 50 42 24 6 8 160 
25-49 eo 6 22 30 30 26 14 2 138 
Q-24 4 6 10 20 
Total age 14 76 144 150 130 70 32 624 


Source: M. B. Hexter, ‘‘ The Newsboys of Cincinnati,” Studies of the Helen S. Trouns- 
tine Foundation, vol. I, no. 4. 


tribution that a good working rule for the choice of the class 
interval was to adopt that interval which would result in 15-25 
equal classes for the complete distribution. This rule suggests 
a rather larger number of classes than is ordinarily desirable in 
the classification of either of the two variables embodied in a 
two-way table. Commonly, not more than eight to twelve 
classes are set up in grouping the observations of each of the two 
variables in such a tabulation. Aside from this one slight modi- 
fication, classification of paired variables follows the line indicated 
in Chapter VI Of course, in entering items and summing 


1 For convenience in later analysis the scale of the X-variable may well be placed in two- 
way distributions at the top of the table instead of at the bottom. 
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classes in the two-way table, there has to be careful recognition 
of the two different classifications which govern simultaneously 
each entry.!. The construction of the skeleton table and the 
tabulation of items are consequently somewhat more com- 
plicated; but neither introduces any new principle.” 

Both the scatter diagram and the two-way frequency table 
serve to bring out any relationships that may exist between the 
paired variables. It is entirely possible that the distribution 
may indicate that apparently no connection exists between X 
and Y. In this case, the paired items will scatter through the 
diagram or table so as to cause the parts furthest away from the 
center to be but scantily occupied, and the parts nearer the center 
of the table to be more closely filled. In idealized form, this case 
would exhibit the distribution of frequencies shown in Table 58. 


TABLE 58. CORRELATIVE DISTRIBUTION OF PAIRED VALUES OF X AND Y 


(Idealized) 
eee VALUES OF X 

oF Y 

1 2 3 4 5 6 7 All 

35 I 2 3 4 3 2 I 16 

30 Z| He Oy 8 | Oi) 4 2 32 

25 3 6 One 9 6 B 48 

20 Oe 4 125 Tome 8 4 64 

15 Bi) @ | @ | we @ | wil 3s 

Io 2 4 6 8 6 4 i“ BZ 

5 I 2 6 4 3 2 I 16 

ANOL |) dey || Be en ae 48 32 | 16 | 256 


1 As already suggested, the two variables which are entered simultaneously in the two-way 
frequency distribution are commonly referred to as the X and Y variables, the X variable 
being the one having its scale horizontally placed, the Y variable, the one having its scale 
vertically placed. 

2 It has been common practice in arranging the two-way table to allow the scale of the 
variable horizontally placed to run from left to right, and the scale of the variable vertically 
placed, from top to bottom. Despite the extent to which this practice has been followed, 
it seems preferable from many points of*view — and much more consistent with statistical 
practice in other particulars — to arrange the scale of the variable vertically placed so that 
it runs from bottom to top. This arrangement will be adopted in this and the following 
chapters. The advantages of the arrangement will appear as the exposition proceeds. 
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Here the frequencies increase regularly as the center of the table 
is approached, from whatever direction the approach is made. 
The table is essentially symmetrical in any cross-section that 
may be made of it, just as it is symmetrical in the two summary 
distributions of XY and Y which appear in the extreme right-hand 
column and bottom row of the table. 

In contrast to this situation, the distribution exhibited by a 
scatter diagram and two-way frequency table may disclose a 
tendency for the items to mass in a band stretching across the 
field. Consider the two-way distribution shown in Table 59. 


TABLE 59. CORRELATIVE DISTRIBUTION OF PAIRED VALUES OF X AND VY 


(Idealized) 
VALUES VALUES OF X 
OF 
ed eal ha Ri Dg OE 
35) . 
3° 9 r 
25 17 : a 
20 ae = 
15 17 * 
sie) 9 cl ‘ 
5 I ; 
Ti. er We ee een ye 


The frequencies occur in this table in a set of compartments which 
run diagonally through the table from the lower left-hand corner 
to the upper right-hand corner. 

Let us consider just what is the meaning of this concentration 
of the frequencies. Even a cursory consideration of the situation 
suggests that the compression of frequencies into a band extend- 
ing diagonally through the table in the fashion observed reflects 
a condition in which relatively low values of one of the paired 
variables tend to be associated with relatively low values of the 
other, high values of the one with high values of the other. If 
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the band were to run from the upper left-hand corner to the lower 
right, the inverse relationship would hold; namely, relatively 
high values of the one variable would be paired with relatively 
low values of the other, relatively low values of the one variable 
with relatively high values of the other. In other words, the 
table discloses what is apparently a persistent relationship 
between the magnitudes of the paired variables. 

This relationship may be shown in another way. Let the 
means of the successive columns, or of the successive rows, be 
computed. The means of the columns of Table 57 are shown 
below, numerically in Table 60 and graphically in Chart 39.’ 


AVERAGE WEEKLY EARNINGS OF 624 NEWSBOYS OF CINCINNATI, 


BY AGE 
TABLE 60 CHART 39 
Average Larnings 
$1.50 
AGE AVERAGE 
(nearest year) EARNINGS 
8 $ £37, 100 
9 .62 
5 Ke) “yiy 
II 8 
5 50 
12 .93 
13 TOL 
14 ee 
I5 1.26 0 


8 9 10 il 12 3 \4 15 
Age (nearest year) 


The means of the rows might be made the basis of a similar plot. 
Both table and chart show clearly that on the average, high values 
of one of the variables are associated with high values of the 
other ; low values of the one with low values of the other. There 
is thus obtained further evidence of the existence of a persistent 
relationship between the magnitudes of the two variables. 

The existence of such a persistent relationship between paired 
variables is the essential feature of correlation. By correlation 
is meant, in brief, a definite tendency for two or more variables 
to vary together. The variables may move in the same or in 


1 This tabulation, showing as it does a relationship between two variables, obyiously is 9 
series. It may be appropriately designated a correlative series. 
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opposite directions, but if they are correlated they are never in- 
different to one another, — they are either mutually attractive 
or mutually repellent. Correlation involves a one-to-one corre- 
spondence between the paired variables. Bowley gives a clear 
statement of the condition as follows: 


“When two quantities are so related that the fluctuations in one are in 
sympathy with the fluctuations of the other, so that an increase or de- 
crease of one is found in connection with an increase or decrease (or in- 
versely) of the other, and the greater the magnitude of the changes in the 
one, the greater the magnitude of the changes in the other, the quantities 
are held to be correlated.” } 


Correlation may be thought of, then, as a ‘“‘tendency towards 
concomitant variation.” The tendency may be conspicuous, or 
barely noticeable, but the one-to-one correspondence of the 
variables must be greater than can be reasonably charged to 
mere chance. 

It must not be supposed that the existence of correlation 
between any two variables proves any simple and direct causal 
connection between the two. The two may be associated as 
kindred effects of some single third factor; or they may display 
like movements because affected by similar though distinct 
underlying influences; or they may show one-to-one correspond- 
ence because within the variables as reported may be unrecorded 
elements which are causally connected. Measurement of corre- 
lation is one thing; interpretation of correlation, quite another. 
In fact, the explanation of observed correlations is one of the 
most difficult tasks in the whole field of statistical analysis.? 

The concept of correlation just given really has two distinct 
phases. In the first place, it carries the idea of varying degrees 
of conformity to some clearly defined (functional) relationship 
between the paired variables; and secondly, it involves the idea 
of this definite functional relationship to which the observations 
of the paired variables tend in some measure to conform. Corre- 
sponding to these two phases of the concept, there are two distinct 
lines of analysis. The first is designed to determine the extent 


1 Elements of Statistics, p. 316, 2 More will be said of this in Chapter XXIV, 
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to which the observed values of the variables actually conform 
to some functional relationship between the two — this is the 
analysis of the extent of correlation. The second is directed 
toward the precise description of the functional relationship 
which best fits, or conforms to, the observed values of the two 
variables — this is the analysis of the form of the correlation. 
These two problems, as we shall see, are very closely related. 
In a measure they are solved by the same line of analysis. This 
analysis will be considered in the following chapter. 


CHAPTER XIII 


THE MEASUREMENT OF CORRELATION 


Ir has been suggested that the correlation problem falls into 
two parts; that having to do with the measurement of the extent 
of correlation, and that having to do with the description of the 
relationship which the measurement of correlation implies. It 
is the problem of measuring the extent or degree of correlation 
which is to be entertained first. 

In the discussion of the preceding chapter on the meaning 
of correlation, use was made of such phrases as ‘‘a high value of 
one variable associated with a high value of the other” and “a 
low value of one variable linked with a low value of the other.” 
Such phrases assume a number of things not altogether clear 
on the surface. It will be helpful to consider at the beginning 
the fundamental question of what is meant by “‘a high value of 
the one variable associated with a high value of the other.” 

It is obvious in the first place that the notion of high and low 
values of any variable must be relative to the average or typical 
size of that particular variable. What constitutes extreme height 
for men of one race might not be extreme height at all for men of a 
different race. What is a small harvest for one crop might be a 
large harvest for another. What is a low price for one commodity 
might be a high price for another. Height, size of crop, price, 
are all to be regarded relatively. 

If this is true, the reduction of original items to items having 
the form of deviations above and below the mean size of the 
variable is indicated. Thus, instead of thinking of an individual 
as five feet ten and one-half inches tall, we are to think of him as 
four inches above the average in height ; instead of thinking of him 
as weighing one hundred and forty-five pounds, we are to think 

Igt 


192 STATISTICAL ANALYSIS 


of him as fifteen pounds under the average in weight. These 
deviation items bring us closer to the notion of “relatively high” 
and “relatively low” values in different variables.’ 

A still further step is necessary before we arrive definitely at the 
concept of equal relative values of the variables. For, after all, 
it is a question of whether four inches above the average of height 
is an extreme departure from the average, or only a moderate 
one; whether fifteen pounds deviation below the average of 
weight is a relatively slight deviation, or an extreme one. In 
other words, we need to apply standards of variability. 

The question of how large or how small a deviation is, can be 
answered only in terms of the typical amount of deviation. 
This takes us back, of course, to the general subject of measures 
of dispersion, in connection with which it was concluded that the 
standard deviation is for most purposes the most satisfactory 
measure of variability. Applying the measure in this case, we 
may say that if the standard deviation of height among the 
particular individuals included in the study is two inches, a 
deviation of four inches is twice the standard deviation. If the 
standard deviation of weight is ten pounds, a deviation of fifteen 
pounds is a deviation of one and one-half times the standard 
deviation. Stated in terms of their respective standard devia- 
tions, the deviation items for height and weight become directly 
comparable.” 

An index of correlation must concern itself with relative 
magnitudes in the paired variables. These relative magnitudes 

1 The form of the deviation items has already been noted. It is «1, x2,.. . %m for the X 
series, and 1, y2, . . . Yn for the Y series. It will be recalled that x1 = X1 — Mz, x2 = 
X2 — M;, etc. It is to be noted that the deviations here are absolute in form and are in the 


same unit as the original items. 
2 The form of these relative deviation items is: 


“1 X2 ... & ° 
—,—,°* =" for the X series, and 
Ox Ox On; 

Vise. Vn . 
—=5 ee = LOL Une ny series: 

Oy oy Cy 


These items have been referred to as “reduced” items. (See E. V. Huntington, “Mathe- 
matics and Statistics, with an Elementary Account of the Correlation Coefficient and the: 
Correlation Ratio,” American Mathematical Monthly, vol. XXVI, Dec. 1919.) They are 
obviously pure numbers and directly comparable in whatever units the two variables may 
have been originally stated. 
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must be in strictly comparable form. We may expect a measure 
of correlation, then, to deal in ‘‘reduced items’”’; that is, in terms 
of the deviations of the paired variables from their respective 
means, these deviations being reduced to units of the standard 
deviations of the variables 


A. THe PEARSONIAN COEFFICIENT 


The simplest and most widely used measure of correlation is 
the Pearsonian coefficient.! This is applicable whenever the 
relationship between the paired variables is linear; that is, in 
the form of a straight line. The coefficient is commonly given in 
the form 


I >) 
ae PANY 
zxy N “ | ea 
a = } | ah 
N O20 y Or0y Us ' 


This is equivalent to 
ee (= ?) 
NN Gi On 

Given a simple linear relationship between X and Y — a relation- 
ship to be expressed in the form of y = mx + b, the equation of 
the straight line — the mean product of the paired items in 
reduced form is a simple and unambiguous measure of the extent 
of correlation.’ 

Certain characteristics of the Pearsonian coefficient may be 
readily deduced. Suppose, for instance, that each reduced value 
of the X variable is paired with a reduced Y item of exactly the 
same value and sign. This, obviously, is a condition of perfect 


correspondence or correlation between X and Y. What will be 
the value of 7, the correlation coefficient, under these conditions ? 


If the reduced values of X and Y are always equal, oe may be 
og 


x . 


substituted for ~ in the coefficient. Let this substitution be 
Oy 
1 Originally suggested by the eminent English biometrician, Karl Pearson. 
Dxy 


. Be eh. 
BiGnce vs = a and o, = \ = , the coefficient also A av ip 


fem 


\J 
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made and the expression simplified. The value of r becomes 
+1. Justas clearly itis — 1 if the reduced values of X and Y are 
equal but of opposite sign. Values of r equal to + 1 and — 1 
therefore represent perfect correlation; + 1 perfect direct cor- 
relation, and — 1 perfect inverse correlation. 

Suppose, on the other hand, that there is absolutely no per- 
sistent relationship between the two variables. In this case, any 
given value of X may be associated with a relatively large, or with 
a relatively small, or with a medium, value of Y. The algebraic 
sum of all the xy’s (Zxy) then tends to become zero, for the nega- 
tive and positive values of xy are likely to be approximately: 
equal. It is clear, therefore, that 7 tends to become zero un- 
der the condition of complete independence between the paired 
variables. 

It should be noted further that 7 has a maximum value when 
large values of the reduced items of one variable are associated 
with equally large reduced values of the other variable. This 
follows from the elementary mathematical proposition that the 
product of any two numbers whose sum is’a constant is a maxi- 
mum when the two numbers are equal. Thus, the maximum 
product of two numbers whose sum is ro is 25; that is, 5 X 5. 
The maximum value of 7 is given when X and Y in reduced form 
are equal; the value of r becomes smaller and smaller as the 
reduced values of X and Y depart more and more from equality. 
But the condition under which the reduced values of X and Y are 
equal is the condition of perfect correlation. The value of + 1 
is therefore a maximum value of the coefficient. The coefficient 
can only assume values between + 1 and — 1, these extreme 
values registering perfect correlation (direct and inverse respec- 
tively) between the paired variables. 

The actual calculation of a correlation coefficient follows lines 
not unlike those previously noted in the discussion of other 
statistical coefficients. The steps to be taken may be con- 
sidered first in a simple case in which no multiple frequencies 
occur. The data of Table 61 — columns () and (6) — will serve 
for this purpose. 
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TABLE 61. CALCULATION OF CORRELATION COEFFICIENT FROM SIMPLE 
SERIES: Lonc MeEtHop 


(Data on Effect of Fertilizer on Yield of Orange Groves) 


L) 
? 
« AMOUNT OF | YIELD OF 


FERTILIZER FRUIT 
F APPLIED (in thou- DEVIATION | DEVIATION ‘ ; 
TELD| (in Ibs. per | sands of lbs.| of X rRom | or Y rrom,| he * x oBy 
acre) per acre) Mz My »\) 
—x— —Y— —4— —y— 
(a) (0) (c) (d) (e) (f) (g) 
A 60 Io = Wii = By 20,7306 29.16 779.2 
B I0o I4 == ios. 3 = yl 10,816 1.96 146.0 
(e 150 12 = 54:3 — 3.4 2,040 11.56 184.6 
D 220 16 15.7 6 247 .36 9.4 
E 240 19 B5e7 3.6 1,267 12.96 128.5 
F 310 17 105.7 1.6 11,236 2.56 169.1 
G 350 20 47] 145.7 4.6 21,316 21.16 670.2 
 < ly 68,567 70.72 2087.0 


Nosy V32V3y V68,567V 79.72 261-85 X 8.94 23.41 


The necessary computations are shown in columns (c) to (g) of ; 


the table. Values have to be found in the first place for the means 
of X and Y(M, and M,). The deviations of the original items 
from their means may then be obtained. Finally, the squares 
and cross-products of these deviations have to be calculated. 
The method, working as it does from the actual means, is straight- 
forward, but in most cases needlessly cumbersome and tedious. 
4 The computation of the coefficient may be considerably sim- 
plified by working from arbitrary origins. Suppose that devia- 
tions of the X variable from the arbitrary origin, A, are given by 
4, % >, ... 4%’, and_that-paired deviations of the Y variable 
YAirom the arbitrary origin B, are given by y’,, y's, . . . y’n. The 
\ Coefficient of correlation in terms of deviations from the arbi- 
~\\\) trary origin is then to be obtained from the expression : 


\ 


\ wee’) =e 

\ A bento 
Re ) Oxy 
With this formula it is possible to work largely with whole 


numbers, a change which greatly facilitates computation. 


= AR? G 
: Ay) _ Z(«ay oS 2087 A’ 2087 _ ey 89 \y 
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But it is possible to go further in the use of the arbitrary 
origins. In Chapter XI, expressions in terms of deviations from 
an arbitrary origin were obtained for o, and g,. These may be 
substituted in the formula for 7. The formula then becomes: 


qule'y’) — Cly 
CD an eGo 
\ @) a coy 3 —¢,2 


Or, multiplying both numerator and denominator by J, 


. D(a'y’) — Newly 
Beals: ~ VS’ 2— NeeVv 2(y’)?—Ne,? 

Computation on the basis of these formule proves highly con- 
venient, particularly if carried out entirely in intervals instead of 
in original units.? 

Another formula to which attention should be called has been 
devised by L. P. Ayres.2 This formula runs entirely in terms of 
the original items. It is as follows: 


ise 2 oe zy — { cia 


Here the complication of deviations with their unlike signs is’ 
entirely avoided. No averages or arbitrary origins have to be 
considered. The computation deals with larger numbers, of 
course, but these are easily handled with the aid of computing 


1 For other closely related formulx, see the Handbook of Mathematical Statistics, p. 122. 
* See Journal of Educational Research, vol. I, March—June, 1920. 
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machines and tables of squares. In general, the formula, for 
computing purposes, has distinct advantages.! 

The extent to which the computation of 7 is shortened by the 
use of such formule as have just been given may be illustrated 
by taking again the data of Table 6r. These data with the full 
details of the calculation of 7 by the shorter method appear in 
Table 62. 


TABLE 62. CALCULATION OF CORRELATION COEFFICIENT FROM SIMPLE 
SERIES: SHORTER METHOD 


(Data on Effect of Fertilizer on Yield of Orange Groves) 


AMOUNT OF YIELD OF DEVIATION OF | DEVIATION OF 
Exeseee : ee A x oe Arsi-| Y ee es . 
PLIED thousands | TRARY ORIGIN } TRARY ORIGIN V if fy! 
Bie (in lbs. per acre) of ibe. per acre)| A (= 200) B (=15) ay oy ree! 
Xx if x y! 
(a) (0) (c) (d) CeyeN CF Ke) 
A 60 sie) — I40 —5 19600 | 25 700 
B 100 14 — 100 —I1 10000 I 100 
Cc 150 12 — 50 —3 2500 9 150 
D 220 16 20 I 4c0 I 20 
E 240 19 40 4 1600 | 16 160 
F 310 ity IIo 2 12100 4 220 
G 350 20 150 5 22500 | 25 750 
=F 6) =inG 68700 | 81 | 2100 
Ce = 7? = 4.3 by =F = 4 
L(x’)? = 68700 Z(y’)? = 81 
L(a’y’) = 2100 
D(a'y") — NexCy 2100— 7.(4.3)(.4) 2088 _ 3 


“V/s (x’)?—No2kV B(y’)2— Ney OG 7.(4.3)8V 81 —7(.4)2 2330 


The comparative ease with which ¢ is here obtained is in marked 
contrast with the previous computation. 

Calculation of the correlation coefficient from a correlation 
table rests upon the same computation formule as have been 


1 With regard to these advantages, Dr. Ayres has made the following statement: “‘Ex- 
perience in using this formula in dealing with the problems arising in the regular work of a 
statistical office indicates that it reduces by something more than one-half the time necessary 
to compute correlations from short series of 20 or 30 pairs of items. On longer series the 
saving is greater and on distribution tables involving some thousands of cases, grouped in 
about 100 compartments, it cuts the time to one-third or one-fourth of that required by the 
old methods. Almost more important is the fact that its use very greatly reduces the 
number of errors made in working out the computations.” Jbid., pp. 220-221. 
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indicated for more simple series. The factor of multiple fre- 
quencies, however, complicates considerably the process of 
computation. Certain computation forms may be used to 
advantage. One of the most satisfactory of these is given in 
Table 63. 


TABLE 63. CALCULATION OF CORRELATION COEFFICIENT FROM CORRELATION 
TaBLE: SHoRT-Cut MetTHop * 


(Monthly Price and Production of Pig Iron) 


MontTaty Price INDEX 


< phon ; 6 (y’y?} S Sur 
-IRON LONNAGE z 8. 
Gninillions of tons) ioe Ee Be Boe 9. 7 fu y Fuy’ \fu(9") y 
8.00 | 8.50 | 9.00 | 9.50 |10.00 
NR sl 4 | 4 8 3) 2a 2 12| 36 
x 2 & 4 | 16 | = 21 2 42| 84 18| 36 
8-2 3 QuEZOn Uy ahs 47 Eb. 47) “47 8| 8 
conor 2 Sj | ics || atk 2 iu 47 ° — 32 
Shiv 4gh ke Io 8 2 A | a 23 | —1I| —23 23-2424 
PON Os ct 7 3 Io |—2|—20] 40 |—17]| 34 
156 70} 266 138 
8 | In 4 No} a a 70 
S) et Omens Cy = ~~ = .4487 (intervals) 
150 
. rr.) 4 eo) i Q 266 
8 | l | | o,? Sent = 1.5038 
7 2 4 o | Gig, = peas (intervals) 
Ba | | | Comes = = — .2244 (inter- 
<a 5 vals) 
“ WH io] foe) al 2— 221 as 2= 
= | 8 + oO + E ox 766 Gz 1.3663 
read o = 1.1689 (intervals) 
a + wn a GS tee) Sx'y/ 138 
w QN | Se) + Lal ——~_-=+—~-=, ] a 
| N a 8846 (intervals) 


Source: A. R. Crathorne, ‘‘Calculation of the Correlation Ratio,” Journal of American 
Statistical Association, Sept. 1922, pp. 394-306. 


* The entire calculation may be carried through in intervals. 
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The only column (or row) here which calls for comment is that 
headed S,. Each item in this column is the algebraic sum of the 
deviations of the items in the row (or column) for which the 
figure is given, the algebraic sum being expressed in units of 
the interval. Thus, the item of the top row in this section of the 
table (the item 12) is obtained as the sum of 4 times 1 and 4 times 
2—1 and 2 being respectively the deviations of the two compart- 
ments in which frequencies appear in this row of the correlation 
table. The fourth item in this same column (the item — 32) 
is obtained as the algebraic sum of 8 X (— 2), 21 X (— 1), 
14 X 0, 3 X (+1), andr X (+4 2). 

The form of the computation as a whole has the advantage of 
giving two independent calculations of the algebraic sum of the 
cross-products of «’ and y’. In general, the form is one of the 
most convenient for obtaining the value of r from a correlation 
table. 


B. Tue LIne oF REGRESSION 


Indices of correlation show the extent to which the data appear 
to indicate a definite functional relationship between the paired 
variables. If there is evidence that such a relationship exists, 
the determination of the most probable relationship constitutes a 
second phase of the correlation analysis. It is this phase to 
which we must now pass. 

The nature of the relationship between X and Y is suggested 
by the calculation of the means of the columns, or of the rows. 
Thus, if a study is made of the data shown in Table 60 and Chart 
39 of the preceding chapter, one may readily draw the con- 
clusion that larger values of X are associated with larger values 
of Y, smaller values with smaller, and that in general with every 
change of 1 unit (7.e. 1 year) in X — the age of the newsboys — 

1 An alternative form is given by L. L. Thurston: ‘A Data Sheet for the Pearson Correla- 
tion Coefficient,” Journal of Educational Research, June, 1922, pp. 49-56; another by Char- 
lier, Vorlesungen tiber die Grundztige der Mathematischen Statistik, pp. 95-96. See also the 
form suggested by Yule, Introduction to the Theory of Statistics, pp. 181 et seq. 

In the computation of the correlation coefficient for time series, certain forms appear to 
be preferable to others owing to the fact that individual pairs of items occur only once and 


consequently the problem of variable frequencies does not arise. This matter will be dis- 
cussed in connection with the analysis in Chapter XX of the correlation of time series. 
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there tends to be an accompanying change of approximately 
15¢ in Y —their weekly earnings. This relationship between 
the two may be represented, then, by a simple line. 

When the relationship between two correlated variables appears 
to take the form of a straight line, the line best expressing the rela- 
tionship between the variables may be readily obtained from the 
Pearsonian coefficient. If the straight line passing through the 
means of the two variables and most closely fitting the paired val- 
ues is represented by the slope form y =-m«,the value of m in this 


equation is given by the expression (e\2 In other words, 
‘ Or 


y= (2)= is the equation of the line most closely fitting the 
Ox 


paired values of X and Y when the relationship between X and Y 
appears to be linear in form.” 

The nature of the line of regression may be illustrated from the 
data in Table 63. The relationship between monthly pig-iron 
tonnage and the monthly price index is given in the equation 


4.9052 
y = .68 Gar Ory = 3.37% 
7 0.5845 Vea seSih 


This means that for every change of a unit in X there tends to 
be a change of 3.37 units in Y; in other words a change of 
1.00 in the price index tends to be accompanied by a change of 
3,370,000 tons in monthly pig-iron production. 

The equation just considered expresses the relationship between 
X and Y in terms of their deviations from their respective means. 
It may be preferable to show the relationship in terms of the 
original observations. The equation of the line of best fit for the 
original observations is to be obtained from the equation already 


172% is known as the coefficient of regression of Y on X. It may be given in 
Ox 


Sy2 
the form + as 
V Dx 
2 Note the fact that this equation may be written 2 =7%. Inother words, the slope of 
oy Or 


the regression line is equal to y when the deviations of X and Y are expressed in units of their 
respective standard deviations. When correlation is perfect (i.e. wheny = 1), > = ~- 
Cy Ox 
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given by substituting (X — M,) for « and (Y— M,) for y. The 
equation then becomes 


(VY — M,) = 124(X — M,). 


Applying this equation t to the data of the illustrative case, we have 


(VY — 19.79) = 3.37(X — 8.64) 
337 a O85 
This states the characteristic relationship between monthly pig- 
iron production and the price index in terms of the magnitudes in 
which the two are given in the original data.’ 

The derivation of the regression lines from the Pearsonian 
coefficients is permissible, as already noted, only when the rela- 
tionship of the variables is linear. When the relationship is non- 
linear, — that is, curvilinear, or “‘skew”’ — other means have to 
be employed. |The regression equations in this case are no longer 
to be had as simple derivatives of the index of correlation.) They 

-are-to be developed instead through specific application to the 


original data of the technique of curve fitting. 


C. THE CORRELATION RATIO 


As already noted, the Pearsonian coefficient measures the 
extent of correlation exactly only if the relationship between X 
and Y is linear. When the relationship between X and Y is not 
linear, the Pearsonian coefficient understates the extent of corre- 
lation, giving a value less than one even when the correlation is 
perfect. Consider the simple case shown in the data of Chart 
4o and Table 64. 


1 The regression line considered thus far is known as the line of regression of VY on X. 
This is the form in which the regression line is usually given. But if the correlation is not 
perfect (if, in other words, r lies between + 1 and — 1), there is really a second regression 
line — namely that of X¥ on Y. The equation of this line in terms of deviations from the 
means is 


where (2 =) is the coefficient of regression of X on Y. (It follows that the correlation 


coefficient is a geometric mean of the two regression coefficients.) Though the existence 
of this second line of regression is to be noted, the line need not ordinarily be given. It ap- 
proaches more and more closely to the other as the degree of correlation increases until, 
when correlation is perfect, the two lines of regression coalesce. 
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CHart 40. CORRELATIVE DISTRIBUTION OF X AND Y, CORRELATION PERFECT 
BUT NON-LINEAR 
Y inches 
Z5 


20 


X inches 


TABLE 64. CORRELATIVE DISTRIBUTION OF X AND Y, CORRELATION 
PERFECT BUT NON-LINEAR 


VALUE OF X 


ALL 
ia Idd yr ut iA 

I 2 tS VALUES 
ms ag” || © Oo w@ © FG & 
ty 161.0 QO © ie © Io 
8 Oi | © yuh © © 15 
4/0. = hon ROS OREO Ke) 
ir er Lae ac Oo © % © ® 
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I 
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Here the relationship between X and Y is curvilinear. It is 
given by the equation Y = X?. 

Consider, as a more complex example of non-linear correlation, 
the relationship shown in Table 65 and Chart 4r. 


¥ = so . a ~\\ —_—__ } 


4 -—_ 


| = 
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TABLE 65. RELATION BETWEEN BUSINESS PRODUCED IN 1921 AND LENGTH OF 
SERVICE WITH COMPANY AMONG SALESMEN OF INSURANCE COMPANY 


CENTERED MovineG 
ARITHMETIC MEAN OF AVERAGE OF MEANS 


LENGTH OF SERVICE NuMBER OF SALES- Business PRODUCED oF Business PRro- 
(in years) MEN IN 1921 BY SALESMEN DUCED 
~, OF GROUP (average of 3 items, 
hs in thousands of dollars) 
I-2 486 $ 98,811 II4 
3-4 333 131,418 123 
5-6 179 142,109 137 
7-8 150 138,013 148 
g--10 106 171,136 163 
11-12 85 158,088 173 
13-14 35 177,856 188 
15-16 29 226,923 107 
17-18 24. 196,875 193 
19-20 23 185,869 181 
21-22 I2 I47,017 163 
23-24 13 156,730 148 
25-26 II 138,682 128 
27-28 10 96,250 IIo 
29-30 6 79,166 99 
Over 30 23 110,326 97 


Source: An unpublished study by Dr. Grace E. Manson. 


In a case of this sort the assumption of a linear relationship 
between length of service and business produced would lead to 
erroneous results. Analysis must deal with the fact that the 
relationship between X and Y is unmistakably curvilinear. 

With curvilinear relationships, the correlation ratio, instead of 
the correlation coefficient, must be used to measure the extent 
of correlation. The correlation ratio of Y on X (designated 7, 
and read eta of Y on X) takes the form 


Tm 
Ohi = ae 
where og, is the familiar standard deviation of the Y’s, and ony, 
the standard deviation of the means of the several arrays (col- 
umns) of Y’s.! 


1 There is a corresponding correlation ratio of X on Y given by the expression nry = - , 
x 


where om, is the standard deviation of the means of the several arrays (rows) of X’s. Asa 


rule, the two ratios are close together, and only one need be computed. The ratio for Y 
on X is the one ordinarily chosen. 
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CHART 41. RELATION BETWEEN Business PRODUCED IN 1921 AND LENGTH OF 
SERVICE WITH COMPANY AMONG SALESMEN OF INSURANCE COMPANY 


Thousands of 
Dollars 


f\\ 


Mo ; 
-—— Individual Group Averages 
---3-Group Moving Average 


7 (Omens ts Ir 19: 2h ose eam oraees 
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The only new line of computation involved in obtaining a 
correlation ratio is that necessary for obtaining omy (OF Omz). 
The steps to be taken are relatively simple. First compute the 
mean of each column of Y’s (or row of X’s). Then obtain the 
deviation of each of these means from the mean of all the Y’s 
(or of all the X’s). Next square these deviations and multiply 
each by the number of items in the given array. Finally, sum 
all these products, divide by N, and extract the square root. 
This result is oy (Or mz), the numerator of the correlation ratio.! 

1 The formula for 7my may be given then as 

2(Nz(Mye — My) 
N 


oy 


Nyx = 


where WV, is the number of items in the column and Mz is the arithmetic mean of the Y’s 
in the column. 


for the correlation coefficient. 
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Actual computation of a correlation ratio is readily effected 
by a convenient addition to the computation form already given 


The nature of the computation 


will be evident from the illustrative case shown in Table 66. 


TABLE 66. CALCULATION OF CORRELATION COEFFICIENT AND RATIO FROM 
CORRELATION TABLE: SHoRT-Cut METHOD 
Montsty Price INDEX 
MOS (s Be 
IG-IRON 
t baGh i) t i? ele 1 1\9 8 ) 
eine 7-50}8.00/8.50]9.00] 9.50 tu VY \fury"\Fuly )*| Se" [Sey (Ss) iP 
of tons) OL LO" LONE LO! LO 
8.00]8.50]9.00]9.50/10.00 
Re. Qs _4|_ 4]. 4 8 Bi 24) 72 r2) 36 144 18.00 
eee 4/16} 1 on Z|) 42\ 84 18} 36 324 15-43 
os 9126) 715 | -47 Ti 47 47 8) 8 64 1.36 
Boe Pe | ae pia) 3 | 4 47 fo) 22 1024 21.79 
Se 8 [to] 8) 2) 2] 2 23 |—1|—23] 23 |—24] 24 576 25.04 
epee ee Fa She £O\ |= 2\— 20), 40° _)— 7) 134 289 28.90 
156 70| 266 138 T10.52 
8 mH Oo N N ne 
a + t MOH 7 ; CSP 
6, = = 4487 (nter=| |= 
: eee = ¥= 766, 7! ( vals) fu_ _ 110.52 
8 el ‘ N 156 
— One Oe ee 20) y= = Ca — 1.5039 = 17085 
: In + om a | 9 156 ie 
8 ; ee 
a ae | oy =1.2263 (intervals) (_ (S02 
a to) Ses 
< co |x — — 
= | 8 +  S x n= 2 = — .2244 (in- 2] = —¢,? 
s 1S tervals) 
=. ap aD aah a 221 Nay = 604 
4 hae hg at oh ek 3663 (S85e 
“3 on 10 |< . pach aoe 
= | 2 SE a ilens o, = 1.1689 (intervals) N 156 
iZa) Lal 
5 X(x'-y’) 138 : = 9483 
Se & be a 8 Re oy oes = .8846 (inter- Dy: = 
ey a a 28 vals) (S,/)2 
ee ees 8 |S oe.) | Nis yee 
am |F Oo 8 RSIS] =.687 Nyx = +705 


Source: A. R. Crathorne, 


Statistical Association, Sept., 1922, pp. 3904-396. 


“Calculation of the Correlation Ratio,’ Journal of American 


This computation form has the advantage of covering both the 
correlation coefficient and the correlation ratio, 


The difference 
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between these two is a test of the linearity of the relationship of 
the correlated variables... When it is doubtful whether the rela- 
tionship of the variables is truly linear, computation of the corre- 
lation ratio is a necessary part of the correlation analysis. 


D. CORRELATION FROM RANK DATA 


The question of correlation is. sometimes raised with regard 
to variables for which no information is available save their rela- 
tive ranks. The data of Table 67 may be considered by way of 
illustration. The problem in this case is to measure the extent 
to which the positions of the individual states in the two rank 
lists give evidence of a correlation between per capita (a) property 
values and (b) automobile registrations.” 

The formula to be used in this case is as follows: 

ee ee ee as 
nx — (Ras 
fwhere R, and R, are the ranks of the individual cases in the 
two rank lists. The value of p, like the value of 7, is confined 
within the limits of +1 and —1. It is closely related to the 
value of 7, and serves as a satisfactory measure of the extent of 
correlation when only rank data are available.* 

The calculation of p may be illustrated from the data of Table 67. 
The steps taken will be evident from an examination of Table 68. 
Published tables giving the value of p for different values of 
>(R, — R,)? and WN facilitate the work of computation and should 
be consulted if any considerable number of computations are to 
be undertaken. 


Having calculated a given measure of correlation, the question 
arises: What significance is to be attached to a measure of the 


1 When the relationship is linear, the values of 7 and 7 are identical. 
2 The method involved here is commonly referred to as the rank-difference method. 
3 The relationship between p and r is actually given by the expression 


7 =2sin Ly 
6 


Through the use of a table such as is given in Appendix H, the corresponding values of 7 
and p may easily be had. As a matter of fact, however, the maximum difference between 
the two is .o18. The difference, therefore, may ordinarily be ignored, 
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TABLE 67. RANK OF THE 48 STATES IN 1922, IN PER CAprTA (a) PROPERTY 
VALUES AND (b) AUTOMOBILE REGISTRATIONS 


RANK IN PER CAPITA PROPERTY VALUES RANK IN Per Capita REGISTRATIONS 
Nevada I California . I 
Wyoming. . 2 Iowa. . 2 
South Dakota B Nebraska . . 3 
Towa 4 South Dakota 4 
Oregon 5 Kansas . 5 
California . 6 Colorado 6 
Nebraska. . 7 Oregon . 7 
North Dakota 8 Indiana 8 
Montana . 9 Minnesota . 9 
Connecticut . Io Michigan TONS 
Washington . II Wyoming Ona 
New Jersey 12 Nevada . 12 
Arizona 13 Washington 13 
Kansas 14 North Dakota 14 
Minnesota 15 Ohio . 15 
New York 16 Wisconsin . 16 
Idaho 17 Idaho 17 
Illinois . 18 Vermont . 18 
Colorado . 19 Oklahoma . 19 
Utah : 20 Illinois . 20 
Massachusetts 21 Florida . Dai 
Pennsylvania 22 Maine OTe. 
Rhode Island 23 Missouri 23 
New Hampshire 24 Montana Ns 
Ohio ae 25 Maryland . 25 
West Virginia 26 Arizona 25 
Indiana 27 Texas ayy 
Missouri . 28 Connecticut 28 
Michigan . 29 Delaware . . 29.5* 
Wisconsin . 30 Rhode Island. . 20550 
Delaware . 31 New Hampshire . 32 
Maryland . 32 New Jersey 32 
Maine . 33 Utah 32 
Vermont . 34 Massachusetts 34 
Florida 35 New York . 35 
New Mexico . 36 Pennsylvania . 36 
Virginia LY | West Virginia ay 
ilexaSaen 38 Virginia 38 
Oklahoma 30 New Mexico Bons 
Louisiana . 40 North Carolina RO 
Tennessee ; 41 Kentucky . 41 
North Carolina . 42 Tennessee . 42 
Kentucky 43 Louisiana 43.5* 
Arkansas . 5 44 South Carolina Herel 
South Carolina . 45 Arkansas 45 
Georgia. 46 Georgia . 46 
Alabama . 47 Mississippi 47 
Mississippi 48 Alabama 48 


* Where two or more states are tied for a given rank, they are assigned a rank equal to the 
average of the positions they would have were they not exactly tied. 
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TABLE 68. CALCULATION OF p FROM RANK DarTA 


(Rank of 48 States in 1922 in Per Capita (a) Property Values and (b) Car Registrations) 


RANK OF STATE DIFFERENCE BETWEEN RANKS 
‘ ees AUTOMOBILE 
TATE OPERTY REGISTRA- 
VALUES TIONS (R, at R,) (R, ae R,)? 
ios Rs — — R, ae 
(a) (0) (c) (d) 
Nevada I 12 — 11 121 
Wyoming. . 2 10.5 — 8&5 72.25 
South Dakota 3 4 — 1 ie 
LOWaumeacn a 4 2 2 4. 
Oregon 5 7 —- 2 4. 
California 6 I 5 25. 
Nebraska 7 5 4 16. 
North Dakota 8 14 — 6 36. 
Montana . ESA re a as e) 25 — 16 256. 
Connecticutesy). sane eee 10 28 — 18 324. 
Wiasbingtontes seein: II 13 —_2 4. 
UNewe Jersey, ns eid eh is oe ae 12 a2 — 20 400. 
ATiZOR Am Utica ee! 1) e- Deere 13 25 — 12 144. 
SAIS meee te Cie fos beta h a ot 14 5 fe) 81. 
Minnesotaeasres “co cain. te ES 9 6 30. 
IN Wa MOL kom y aeee ne, conta nee 16 35 — 19 361. 
idaho abe ce, ot sae Gal 17 17 ° °. 
MiMOCMMEE, tose) es ae 18 20 — 2 4. 
GColGradomeaa es Sion ae bea os 19 6 13 169. 
ULGON ERC co eo ee 20 32 — 12 144. 
IMascachusetism se. «= 2.5 21 34 — 13 169. 
Rennsylyaniay f= 39. | 3 22 30 —14 196. 
Rhoderislandgmeey ss 5 a aon 23 20.5 / — 6.5 - 42.25 
New Hampshire ... . . 24 32 — 8 64. 
Oimownrat tees ot: ant te se 25 15 Io 100. 
WieStNVATOINIAS GG os a ee 26 37 — 11 121. 
inGianay sane ew cl, Se qoher cs 27 8 19 361. 
IMISSOUTIM Rawr. ver St opti 28 230 5 25. 
INichigany. Gari a Goes a 29 10.5 10.5 380.25 
WASCOMSING. G8 os) es use 30 16 14 196. 
Delaware BP re MS oe eee 31 20.5 ES 2.25 
Marviand as iso) 2 cle ee 3 32 25 7, 49. 
Waites ieee ss tee Bele 33 21.5 Tins 132.25 
Wermontie ©. Mache oe ws ie 34 18 16 256." 
RIGTICa easels mist eons Meryl 35 21.5 13.5 182.25 
INewalNviexicom! Gia b 4 0 36 30.5 — 3.5 12.25 
WATeINTaY Oe dol ty Se om eo oa 37 38 — 1 Ts 
pexaStenee Geir hs Sie a ey 38 247 IL T21. 
OxJjahomay alee sade ee 390 19 20 400. 
LeQuisiandeeraes os rs 40 43.5 "“— 3.5 [2.25 
eRenNeSSCOMiyck i ise. i 41 42 — I ie 
INostheCarolinaye ss. sa. 1 te 42 39-5 DAS 6.25 
INGNCUGR VARIRa Scatle be) ce eat ot 43 AL 2 4. 
(ATanGaS ee Bis haw a ae 44 45 — 1 T: 
Soughi@arolinas., @ tp .. . to 45 43-5 15 2.25 
Georrlane eat es tera eS 46 46 ° °. 
Alabamaere a) ct yoo, 47 48 at it 
IMUISSISSIDDiv uae sone | oh as 48 47 I te 
5041.50 
Be es Ree Gee 
N(N?2 — 1) 110544 


THE MEASUREMENT OF CORRELATION 209 


particular magnitude secured?! Correlation coefficients of + 1 
and — 1 are clearly enough indicative of perfect correlation; 
and a coefficient of o is convincing proof of complete independ- 
ence. Actual coefficients based upon empirical data, however, 
never exhibit these limiting values. True, in the case of care- 
fully controlled laboratory experiments, correlation may closely 
approximate + 1 or _—1, since the only disturbing factors here 
are factors giving rise to unavoidable error in the manipula- 
tions or observations. But the complex uncontrolled factors 
of the economic and business world rarely show very high correla- 
tion coefficients. Analysis is consequently faced with the serious 
problem of interpreting coefficients which fall in all conceivable 
positions between + 1 and o. : 
It is not feasible in an elementary text to deal comprehensively 
with the difficult problem of interpreting statistical coefficients 
in the light of the general theory of probability. About all that 


can be done is to state certain general conclusions. Coefficients ~ 
above .7o give almost certain evidence of correlation, and any , 


above .50 are ordinarily significant; coefficients under .30 give 
very little indication of any definite connection between the 
variables. Unmistakable correlation at one extreme shades off 
by imperceptible gradations into unmistakable independence at 


the other. In most cases the coefficient assumes intermediate » 


values, and conclusions as to the extent of correlation have to be 
stated with great caution.’ 


1The measurement of correlation when more than two factors are included in each 
“pairing ”’ is not undertaken in the present study. This phase of the correlation analysis — 
commonly referred to as the phase of ‘“‘ multiple correlation’ — involves technical problems 
which at this time hardly belong in an introductory treatise. Helpful explanations of the 
subject may be found in Kelley, Statistical Method, chap. XI; Mills, Statistical Methods, 
chap. XIV; and Yule, Introduction to the Theory of Statistics, chap. XII. 

2Tt is often stated that one may be practically certain that the correlation coefficient is 
indicative of a persistent connection between the paired variables if the value of 7 is as 
great as six times its probable error. The probable error of the coefficient is given by the 


—ve 
expression 0.6745 (x Ah ) , where WV is the number of items in each of the paired series. 
N 


But there are serious limitations upon the applicability of the concept of the probable error. 
Some of these limitations will be considered in a later connection. (See Chapter XXIV.) 

3 Tt should be noted that values of the indices of correlation are not to be thought of as 
percentages. A coefficient of .74 cannot be regarded as indicative of twice as much correla- 
tion as is shown by a coefficient of .37! 
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Further discussion of the interpretation of correlation coeffi- 
cients will have to be deferred to a later chapter on the general 
subject of the interpretation of statistical results. There are 
few more difficult tasks for the statistician than the evaluation 
of correlation coefficients. Distinctions between real and 
spurious correlation have to be carefully drawn. Enough has 
already been said, however, to indicate the general nature of 
correlation and the means by which the extent of correlation in 
paired variables is to be initially determined. 


ANALY olosOUSSPA TIA SERIES 


CHAPTER «XIV 


SPATIAL SERIES AND THEIR GRAPHIC COMPARISON 


ALL series express differences in one variable with reference to 
differences in another. Spatial series express differences in one 


variable with referencé to—differences_of spaee—or” geographic 
_locatton,t Serving as the other, or independent, variable. 


St of this. type present special problems in statistical-anatysis, 
and call for separate treatment. 


A. THe NATURE AND FORMULATION OF SPATIAL SERIES 


One variety of spatial series — the spatial distribution — has 
already been considered (see Chapter VII). In spatial distribu- 
tions the individuals belonging to some larger group are spread 
out or classified according to their geographic positions. Dif- 
ferences among the classes in this case are obviously spatial in 
character. The values of the dependent variable, on the other 
hand, express frequencies of occurrence in the designated areas. 
These frequencies are strictly additive; they may be summed to 
obtain a significant aggregate pertaining to the total area covered 
by the distribution. 

But not all spatial series are distributive in character. The 
price of steel ingots at fifty different points in the United States 
may be set up as a spatial series. Such a series is not a distribu- 
tion in any sense of the word. Similarly, an orderly statement 
of yields per acre of corn by counties is a spatial series, but not a 
spatial distribution. The latter is simply one form of a more 
inclusive type of statistical series, namely, the spatial series. 

Several varieties of spatial series may be distinguished. In 
addition to the spatial distribution already considered, there is 
the series exhibiting the spatial variation of an individual element. 
Atmospheric pressure at different points is an example; the price 
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of a single commodity in different markets — such as the illustra- 
tion cited above — is another. Other spatial series show aver- 
ages over designated areas; still others, rates and ratios. The 
form of the dependent variable may vary widely in different 
spatial series, but the independent variable is always space, or 
geographic location. 

The steps involved in the formation of a spatial series have 
been considered already in the study of the spatial distribution. 
Nothing need be added here to the directions already given in 
that connection for the selection of the intervals (or areas), the 
determination of the class limits, and the choice of the class 
designations. The rules which govern the set-up of the spatial 
distribution are to be recognized in the formulation of other 
varieties of spatial series as well. 

One special point, however, is to be noted: the difficulty of 
unequal class intervals, so serious in the analysis of the spatial 
distribution, is less troublesome when the series is not distributive 
in character. In all distributions the number of individuals 
falling within each of the several classes depends in part on the 
size of the classes. If this size differs from class to class, impor- 
tant characteristics of the variable are obscured; the distribution 
is defective on this account. But if the dependent variable is 
of such a character as to be little, if at all, affected by differences 
in the size of the class interval (or area), this defect is reduced to 
negligible proportions. 

When unequal classes are encountered in spatial distributions, 
it is not infrequently desirable, therefore, to reduce the series to a 
somewhat different form. For example, the varying population 
of the different states of the Union is in considerable measure the 
consequence of their widely varying areas. Certainly, these 
varying areas have to be taken into account in many connections 
in interpreting the differences in the state populations. But if, 
instead of the aggregate population by states, the population 
per square mile by states is taken, the variable in this form 
obviously makes full allowance for differences of area, and is 
independent of the divergent sizes of the several states. The 
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series has ceased to be a spatial distribution of an aggregate 
population, and is in the form of a derived variable — namely, 
density of population — which is largely independent of the 
differing areas of the individual states. 


Montaty MEAN oF FREE AIR TEMPERATURES, JULY 1, 1907, TO JUNE 30, I910 


(Mount Weather Observatory) 


TABLE 69 
ALTITUDE OVER | TEMPERATURE 
SEA LEVEL (in degrees Centi- 

(in feet) grade) 

500 15.6 

750 I4.0 
I000 TAS 
1250 10.6 
1500 9.0 
1750 7-4 
2000 5.8 
2250 4.3 
2500 2.9 
2750 1.5 
3000 0.2 
3250 == 116) 
3500 SOS 
375° = 3-9 
4000 Barnes 
4250 — 6.6 
4500 © seh 
4750 — 9.6 
5000 = 1,2 
5250 TS) 
5500 — 14.3 
5750 — 15.8 
6000 = GB 
6250 — 188 
6500 == Fy 
6750 21.0 
7000 — 22.9 
7250 — 24,2 


Degrees Centigrade 
2 


. pa 


Temperature 


at. 
° 


CHART 42 
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Source: Bulletin of the Mount Weather Observatory, vol. IV, pt. 2, p. 32. 


It is not necessary in such a transformation that the basis of 
the relative be a unit of area like the square mile. It may be 
thousands of population. 


If one were to give by states the 
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number of inhabitants residing in urban centers, these aggregates 
would be in some measure affected by the areas of the state and 
more directly by the number of cities within the state limits. 
If the variable is made the percentage of total population in the 
state living in urban centers, the differences of area among the 
different states do not play so important a part. Such con- 
version of spatial distributions into spatial series of relative items 
may form an important part of the analysis of spatial variation. 


CHART 43 
FEDERAL RESERVE PROSPERITY 


(Earnings of the Federal Reserve Member Banks Expressed in Average Per- 
centage by Districts) 


EARNINGS OF THE FEDERAL 
RESERVE MEMBER BANKS 
EXPRESSED IN AVERAGE 
PERCENTAGE BY DISTRICTS 


Source: The New Republic, Dec. 26, 1923, p. 10. 


Spatial variation — as previously noted — may relate to dif- 
ferences along a line or over an area. Variation along a line is a 
relatively simple form. Table 69 and Chart 42 give a concrete 
illustration. Cases of this sort call for no extended consideration. 
Analysis proceeds along lines that are fairly evident. The 
further discussion of spatial variation may be profitably confined 
to the areal, rather than the linear, form. 
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B. GRADED Maps 


The difficulties of effective presentation — noted in the con- 
sideration of spatial distributions — are encountered with all 
varieties of spatial series exhibiting variation over areas. No 
simple textual or tabular treatment succeeds in making plain the 
character of the variation. Some form of statistical map needs 
to be brought into service. Maps for the presentation of spatial 
distributions have already been considered; maps for the pres- 


CHART 44 


PERCENTAGE OF ILLITERATES IN THE POPULATION I0 YEARS OF AGE 
AND OVER, I9I10 


16 to 25 per cent. 
GBB 25 per cent and over. 
The heavy lines (em) show geegraphic divisions. 


Source: Thirteenth Census of the United States. 


entation of other varieties of statistical series assume somewhat 
different form, and call for special consideration.’ 

For statistical series which are not distributive in character, 
the appropriate graphic device is the graded map. An illustra- 
tion of this type of map is given in Chart 44. Here the differing 
magnitudes of the variable — percentage of illiterates — are 


1 Of course, the use of the outline map with numerical data directly inserted is permissible 
with non-additive variables as well as additive. Chart 43 affords an interesting illustration. 
This form, however, is not especially effective. Moreover, it is so easily made as to require 
no further comment. 
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cast into a limited number of classes — seven, to be exact. 
These classes are assigned shades of black and white and the 
several subdivisions of the map then covered with the shades 
which correspond to the magnitudes of the variable in these 
areas.} 

A number of distinct steps are involved in the construction of a 
graded map. First, the several grades into which the magnitudes 
of the variable are to be cast must be determined. Second, the 
shades of black and white, or of color, by which these grades are 
to be represented must be fixed. Third, the several divisions of 
the map must be appropriately covered with these shades. The 
details of these steps may be considered separately. 


1 Graded maps are sometimes used to represent situations in which there is no real varia- 
tion, z.e. no quantitative differences. Chart 45 is an illustration. In a case such as this, 


CHART 45 


THe DISTRIBUTION OF NATIVE TIMBER IN THE UNITED STATES AT THE TIME OF 
SETTLEMENT 


(The line running from Portland, Maine, to Puget Sound shows the successive migrations 
of the lumber market during the past century) 


a 


NORTHERN AND WESTERN 
CONIFERS, PINES, SPRUCES, FIRS. 


WI CENTRAL HARDWOOD. 
DT can enonveeren » 


SOUTHERN HARD PINES 


Source: Philip W. Ayres, ‘‘Tackling the Forestry Problem in Time,” American Review of 
Reviews, July, 1922. 


it is preferable not to use gradations of black and white, orof any single color. The 
differences exhibited are “‘of kind” not “‘of degree” and should be represented by coverings 
which imply distinctions but not gradations. (See Chart 46, section A, below.) 
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The principal question to be considered in setting up the grades 
(or classes) of the dependent variable concerns their number. 
In general, the greater the number of classes, the greater the 
accuracy with which the differences of the variable can be repre- 
sented. On the other hand, the greater the number of grades, 
the greater are the difficulties of obtaining clearly distinguishable 
shades of black and white, or of color. As many as ten clearly 
graded shades are obtainable, but this appears to be practically 
the limit. Ordinarily, six to eight grades are as many as can be 
safely attempted. 

Whatever the number finally adopted, the grades should 
never overlap and should be made to correspond to equal intervals 
of the range of the variable. Not infrequently, a scale is adopted 
which assigns to one grade a range two or three times as great 
as that represented by some other. The resulting map is always 
misleading in greater or less degree. Equality of interval is a 
virtue here as elsewhere in statistical analysis. 

Once the grading scale has been adopted, the problem of fixing 
the corresponding shades of black and white, or of color, is met. 
If the map is to be constructed in color, even gradations of in- 
tensity may be obtained with fair accuracy by repeated applica- 
tions of a uniform light wash. When care is taken to allow each 
wash to dry thoroughly before applying the next, it will be found 
that the intensity of the final shades will be proportionate to the 
number of washes applied. More elaborate maps are sometimes 
made in two colors, increasing intensity of the one representing 
‘divergence from the mean in one direction, and of the other, 
divergence from the mean in the opposite direction, the mean 
being represented by absence of color. If only an original copy 
is desired, color maps are to be viewed with favor since they make 
a strong appeal and are undoubtedly most effective. When, 
upon the other hand, reproductions may be necessary, it is pref- 
erable to make the map originally in black and white, since 
reproduction in color is almost prohibitively expensive, whereas 
photographic reproduction of black and white figures is extremely 
economical. 
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If black and white alone are to be used in the map, varieties 
of dotting and cross-hatching are the means ordinarily employed. 
Considerable experimentation will be found necessary to obtain 
satisfactory results. In Chart 46, section B, twenty-five dif- 
ferent effects are shown, ranging from clear white at one extreme 
to solid black at the other. By careful selection from such a 
collection of linings, shades may be obtained which convey clearly 
the impression of even gradations in the magnitudes of the vari- 
able. 

The final step in the making of the graded map is the applica- 
tion of the selected grades of black and white, or color, to the 
several subdivisions of the total area.” This requires little more 
than technical skill with pen or brush. It should be noted, 
however, that if considerable areas are to be cross-hatched, certain 
mechanical devices can be used to good advantage. Cross-hatch- 
ing triangles, and other more elaborate devices, assure an even- 
ness of effect which is quite unattainable without mechanical aid. 


C. GRAPHIC COMPARISON OF SPATIAL SERIES 


Graphic devices play a more important part in the study of 
spatial series than in the analysis of any other type of statistical 
material. This follows from the fact that the character of spatial 
variation over considerable areas cannot be effectively shown 
except by means of the statistical map. Graphic forms are a 
material aid in a great many phases of analysis; in the examina- 
tion of spatial series, they are an indispensable instrument of 
comprehensive analysis. 

The importance of the statistical map in the study of spatial 
variation can best be indicated through an examination of con- 
crete illustrations. Attention may be directed to cases in which 


1Tf a large number of graded maps are to be constructed and later reproduced, time and 
expense may be saved by preparing special papers in the desired shades of black and white 
and then cutting and pasting these so as to cover the map with the appropriate gradations. 
For the details of this method, as well as for further valuable suggestions on the construction 
of statistical maps, see W. Z. Ripley, “‘ Notes on Map-Making,” Quarterly Publications of 
the American Statistical Association, Sept., 1899. W. C. Brinton, Graphic Methods for 
Presenting Facts, chaps. XI and XII, may also be profitably consulted. 

2 The finished map should always carry a key showing the exact significance of each of 
the grades employed. 
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CuartT 46. DISTINCTIONS OF SHADING IN BLACK AND WHITE 


Section A: AREAL DISTINCTIONS Section B: DENstty DISTINCTIONS 
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Source: Department of Agriculture. 
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two or more spatial variables are graphically compared. The 
simplest condition is that in which identical scales are employed in 
the different maps by means of which the nature of the dif- 
ferences or changes in the variable are made evident. Compari- 
sons in which two or more distinct scales have to be used raise 
more serious questions. 


CHart 47. RETREAT oF YELLOW FEVER IN THE WESTERN HEMISPHERE 


Source: G. E. Vincent, The Rockefeller Foundation: A Review for 1924, p. ae 


One case of the more simple condition is the comparison 
of the same variable element at different times. The maps 
appearing in Chart 47 afford an interesting example. The two 
figures show most effectively the progress that was made from 
1900 to 1920 in reducing the areas of yellow-fever infection, 
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Once the necessary data are in hand no considerable problems 
are encountered in the construction of maps of this sort. If the 
variable has been consistently defined and uniformly recorded at 
the different dates, the maps afford an admirable indication of the 


CHART 48. GEOGRAPHIC DISTRIBUTION OF PEACH- AND APPLE-TREE 
ACREAGE IN THE UNITED STATES, 1910 


UNITED STATES 


PEACHES 


TREES OF BEARING AND 
NOT OF BEARING AGE 
APPROXIMATE ACREAGE 
EACH DOT REPRESENTS 500 ACRES 


UNITED STATES 


APPLES 

TREES OF BEARING AGE 
APPROXIMATE ACREAGE 

EACH DOT REPRESENTS 500 ACRES 


Source: Finch and Baker, Geography of World’s Agriculture, p. 80, 
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changes that have taken place geographically in the values of the 
variable. 

A somewhat different case is presented by contrasts among like 
variables at a single date of record. Consider the two maps 


Cuart 49. GEOGRAPHIC DISTRIBUTION OF FOREIGN-BORN IN THE UNITED 
STATES, IgIo 


1, GERMANY 2. RUSSIA AND FINLAND 


3. AUSTRIA-HUNGARY 4, IRELAND 
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Gy + To 6 PER CENT 


Source: Statistical Aflas of the United States, 1914. 
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shown in Chart 48. Maps such as these are planned and 
constructed according to principles already reviewed. No 
complications are encountered, since the variables are of such 
a character that a single dot may represent the same number 
of units in both cases. The distributions, consequently, are 
strictly comparable. Such maps serve to make perfectly clear 
the contrasts that may appear among spatial series of the dis- 
tributive type. 

Still another case is represented in the maps of Chart 49. Here 
the variable is not distributive, and the graded, rather than the 
dot, map has to be employed. But identical gradations are 
available since the proportions of all six foreign-born groups or 
nationalities are stated in the same unit. The marked differences 
in both the extent and location of geographic concentration among 
the nationalities are thus made strikingly clear. No device 
other than the map could be used as well in this particular phase 
of analysis. Moreover, the use of no other instrument could 
present fewer difficulties. So long as the several variables find 
expression in the same unit and occupy approximately the same 
range of the scale so that the same scheme of dots or shades 
may be used in the construction of the entire set of maps, the 
task of accurate graphic comparison offers practically no difficul- 
ties save those of satisfactory technical execution of the plan. 

Complications appear, however, when the variables to be con- 
trasted are expressed in different units. The maps shown in 
Chart 50 serve as an illustration. The comparison of dot maps 
such as these raises questions concerning comparability, owing to 
the fact that the number of cases represented by a single dot 
varies from map to map. The two maps in the illustration 
appear to be successful in conveying accurate impressions of the 
differences between ‘‘where the beef comes from” and “where it 
goes.” At times, however, in similar comparisons, distorted 
images are presented through improper selection of the graphic 
unit. Where the value of one variable in terms of the other is 
_ definitely fixed, this relation should be observed in constructing 
the maps. Where no such definite relationship exists, care should 
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Cuart 50. GEOGRAPHIC DISTRIBUTION OF BEEF PRODUCTION AND CONSUMPTION 
OF THE UNITED STATES 
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Source: Swift and Company, Year Book, 1922, pp. 22 and 23. 
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CHART 51. WARD DIFFERENCES IN THE FINANCIAL SUPPORT OF 
SCHOOLS IN PITTSBURGH, 1907 
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Source: Shelby M. Harrison, “The Disproportion of Taxation in Pittsburgh,” Pittsburgh 
Survey, pp. 176 and 178. 
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be exercised to see that values assigned to a single dot in the 
different maps result in comparable distributions. 

In the comparison of spatial series which are not distributive 
in character, unlike units present even greater difficulties. Chart 
51 affords an illustration of this sort of comparison. The two 
variables represented in these maps are in entirely different units. 
In what terms are the gradations of the variables to be stated in 
order that the final shaded maps may present accurately and fairly 
any contrast in the geographic variation? 

The simplest treatment of this problem involves a division of 
the effective ranges of the two or more variables into the same 
number of equal divisions. The maps appearing in Chart 51 
follow this practice to the extent of adopting the same number 
of gradations (10) in the treatment of the two variables repre- 
sented — assessed valuation of property per pupil, and tax 
rate for maintenance of school buildings. The intervals of 
the two scales, however, are decidedly unequal.! On this ac- 
count, accurate comparison of the two variables is not easy. 
Effective analysis of spatial differences ordinarily requires that 
the same number of equal intervals be recognized in the 
scheme upon which the several maps of a single exhibit are con- 
structed. 

A further step may be profitably taken, at times, in the study 
of spatial differences among variables expressed in unlike units. 
This consists in relating the gradations of the map to the means 
and standard deviations of the individual variable. Suppose 
the problem is to compare by states, the proportion of Ford cars 
to all others, with the percentage of the rural to the total popula- 
tion. The data are givenin Table 70. The means and standard 
deviations of the two variables are as follows: 

M, a 
1. Percentage of Fords 49 8 


2. Percentage of rural population 58 aI 


1Tt should be said that unequal intervals in graded maps are very common. Doubtless 
in some instances they are warranted (see Chapter V, supra). 
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TABLE 70 A. PROPORTION OF ForD PASSENGER CARS TO TOTAL PASSENGER- 
Car REGISTRATIONS, JANUARY, 1923, BY STATES 


STATE 


Maine 


New Hampshire . 


Vermont 
Massachusetts 
Rhode Island . 
Connecticut 
New York . 
New Jersey 
Pennsylvania . 
Ohio . 
Indiana . 
Tilinois 
Michigan 
Wisconsin . 
Minnesota . 
Iowa . 


PER CENT 


STATE PER CENT 
Missouri . : 56 
North Dakota . 56 
South Dakota 52 
Nebraska . 59 
Kansas 7 
Delaware . 49 
Maryland BO 
Virginia 58 
West Virginia 44. 
North Carolina . 59 
South Carolina . 56 
Georgia 54 
Florida 45 
Kentucky 58 
Tennessee. 56 
Alabama . 52 


STATE 


Mississippi 
Arkansas . 
Louisiana . 
Oklahoma 
Texas . 

Montana . 
Idaho . 

Wyoming. 
Colorado . 


New Mexico. 


Arizona 
Utah 
Nevada 


Washington . 


Oregon 
California 


Source: ‘Proportion of Fords Varies Greatly in Different States,” Automotive Indus- 
tries, September 6, 1923, pp. 481-483. 


TaBLE 70 B. PER CENT RURAL POPULATION TO ToTAL POPULATION, 
1920, BY STATES 


STATE 


Maine 


New Hampshire . 


Vermont. 
Massachusetts 
Rhode Island . 
Connecticut 
New York . 
New Jersey 
Pennsylvania . 
Ohio . 
Indiana . 
Illinois 
Michigan 
Wisconsin . 
Minnesota . 
Towa . 


PER CENT STATE 
61 Missouri . : 
37 North Dakota . 
69 South Dakota . 
5 Nebraska . 
3 Kansas 
32 Delaware . 
17 Maryland 
22 Virginia 
36 West Virginia 
36 North Carolina. 
49 South Carolina . 
32 Georgia 
30 Florida 
53 Kentucky 
56 Tennessee 
64 Alabama . 


PER CENT 


STATE 


Mississippi 
Arkansas . 
Louisiana . 
Oklahoma 
Texas . 

Montana . 
Idaho . 

Wyoming . 
Colorado . 


New Mexico. 


Arizona @ 
Utah 
Nevada 


Washington . 


Oregon 
California 


PER CENT 


Source: Fourteenth Census of the United States, 1920, vol. I, pp. 46, 47- 
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A scheme of gradations for two shaded maps may be readily 
set up on the basis of these values. Suppose seven distinct 
grades of each variable are to be recognized. The standard 
deviations of the two variables stand in the ratio of 8 to 21, or 
approximately 1 to 24. The percentage classes in the second 
map (showing the percentage of rural population) should, 
therefore, be approximately 2} times as wide as in the first map 
(showing the percentage of Fords). A convenient arrangement 
may be had by letting the grades of the first map cover 5 per cent, 
those of the second map 12% per cent. The mean value of both 
variables appears at approximately the middle of the fifth class. 
The scheme is indicated in Table 71. 


TABLE 71. COMPARABLE GRADES FOR Two SHADED Maps 


PERCENTAGE CLAss 


GeAbe GNC MLAG (designated by mid-value) 


(beginning with lightest) 


RURAL POPULATION OF TOTAL 


FORD CARS OF ALL 
CEES POPULATION 


I 30 12 
% 35 20 
3 40 325 
4 45 45 
5 5° 572 
6 55 70 
7 60 824 


With scales of this sort in hand, graded maps are readily 
constructed whether drawn in color or in black and white. By 
using one color for deviations on one side of the mean and another 
color for deviations on the other side, strikingly effective results 
can be obtained. Maps constructed on this plan of adjusted 
classes dispose definitely of questions regarding the comparabil- 
ity of the grades in which the different variables are represented. 

Comparison of spatial series by means of statistical maps 
generally has the purpose of discovering significant relationships. 
These relationships may be direct, or may result from some under- 
lying factor not specifically indicated in the variables represented. 
Sometimes the form of the series suggests what this underlying 
factor is. 
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For example, there are basic physiographic factors which 
determine the spatial variation of many other phenomena. 
Among agricultural elements, temperature and rainfall are such 
factors. In the maps of Chart 52 the geographic variation of the 


CHART 52. GROWING SEASON AND ANNUAL PRECIPITATION IN THE UNITED 
STATES 
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Source: Baker and Strong, ‘‘ Arable Land in the United States” (Separate No. 771, from 
the Yearbook of the Department of Agriculture, 1918). 
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average length of the growing season and the average annual 
precipitation is shown for the United States. Familiarity with 
the spatial characteristics of these factors enables one to detect 
at once some of the influences determining the geographic features 
of various economic and social phenomena. Likewise, familiarity 
with other characteristic forms enables the statistical analyst to 
determine the lines along which further investigation may be most 
profitably undertaken. 

When the form of two or more spatial series is like or opposite 
in marked degree, the relationship between the variables may be 
further examined through the correlation analysis. The form 
of the series is ignored in this case, attention being focussed upon 
the relationships disclosed in the paired observations. Pairing 
here is spatial; that is, an observation of one variable is linked to 
an observation of the other variable by its occurrence at the same 
place. From this point on, analysis concerns itself with the 
magnitudes of the variables and not with the geographic location 
of the cases. The technique of analysis becomes practically 
identical with that previously considered for variables unrelated 
to time and space. 

From what has been stated, it is evident that the study of 
spatial series follows a number of distinct lines. Attention may 
be focussed in the first place on the way in which the individuals 
of some large mass are dispersed geographically. In some 
connections a single spatial distribution is of major significance. 
Commonly, however, the analysis includes two or more variables. 
These may be consolidated in some compound variable, the 
spatial differences of which are directly examined ; or the individ- 
ual variables may be examined separately and their spatial 
forms compared. By this means significant relationships may 
be disclosed, and promising lines indicated for correlation analysis. 
In general, the study of spatial series is a special phase of statis- 
tical analysis repaying careful cultivation in many lines of inves- 
tigation. 
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CHAPTER XV 


THE NATURE OF TIME SERIES 


PERHAPS the most interesting and significant variety of series 
in economic and business research is the time series. A statistical 
series has been defined as an orderly set of numerical items 
expressing the relationship between two variables: Ina time 
series;-one of the two variables — the independent — is time; 
the individual magnitudes of the other variable — the depend- 
ent — relate to different points or intérvals of the independent 
variable, time. If the series is in customary form, the successive 
items are consecutive in time. The general character of time 
series is thus easily recognized. 

Time series are, in fact, a very familiar form. A number of 
concrete examples will serve to illustrate the type. 


TABLE 72. TONS OF CARGO ON COMMERCIAL VESSELS 
PASSING THROUGH THE PANAMA CANAL 


By Fiscal Years, 1915-1922 


YEAR Tons oF CARGO 
1915 4,888,454 
1916 3,094,114 
1917 7,058,563 
1918 7,532,031 
I91Q 6,916,621 
1920 9,374,499 
1921 11,599,214 
1922 10,884,910 


Source: Statistical Abstract of the United States, 1922, Table 274. 
231 


232 STATISTICAL ANALYSIS 


TABLE 73. Raw Corron (ExciustvE or LinTERS) CoNSUMED IN THE 
UNITED STATES 


By Months, January, 1914, to December, 1920 
(Unit: thousand bales) 


Monts 1914 1915 1916 1917 1918 1919 1920 
Jenteal Sitiias 467.9 542.1 601.4 523.9 556.9 591-9 
February . .| 455.2 463.3 540.7 547.2 510.1 433-3 5ise7 
March . .| 403.4 524.09 613.8 603.9 571-4 433-5 575.8 
Noval) ge. 6 lL eee 514.0 530.7 ee) 544.1 475-9 566.9 
May 2) = 5) 400-7" |" 40358 |) 575.6' |) 615-4) 1575-0) | 487-00 eas 4a. 
[eine a oe |] eNO 514.7 570.6 BAL 515.8 474.3 RGD 
July. =. < |" 448-3 |’ 496:8° | 489.5 | 537-8 | 541.5") “Sx0esas.5 
August, = .'| 383.7 | 404.4" | 557-8 | 569-5. | 535.0) 407-3) sAog.0 
September .] 414.9 498.7 528.3 522.4 490.0 4QI.I 458.0 
October. . |) 451.0 500.8 550.7 584.9 440.4 556.0 401.3 
November .] 420.7 514.7 583.0 590.4 455.60 491.3 332.7 
December .| 450.9 555-0 536.7 516.5 472.9 ie tcyy 295-3 


Source: Review of Economic Statistics, prel. vol. V, January, 1923, p. 58. 


TABLE 74. RATIO OF COIN AND BULLION TO THE SUM OF TOTAL DEPOSITS AND 
Note CIRCULATION OF THE BANK OF ENGLAND 


By Weeks, January 1 to June 25, 1919 


Date PER CENT ’ Date PER CENT 
lanteer Dew] ADIN 2 36.7 
8 33.8 9 38.4 
15 36.2 16 ERS) 
22 36.0 23 39.0 
20 37.1 30 38.3 
Feb, 5 Bae May 7 39.8 
12 37-4 14 40.6 
19 37-4 21 40.1 
26 37-9 28 39.2 
Mar. 5 36.0 June 4 38.3 
12 37.2 II 38.0 
19 37.7 18 39.5 
26 38.1 25 B7Ee2 


Source: Review of Economic Statistics, Monthly Supplement, November, 19109, p. 11. 
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The first two of these illustrative cases will be recognized as 
simple temporal distributions. Distributions of this sort are 
clearly time series; in fact, they constitute one of the most im- 
portant varieties of such series. They are distinguished by the 
fact that the individual items afé additive, and may be simply. 
summed to form series of analogous items relating to longer in- . 
tervals of time. eee Site See 
~ But not all time series are temporal distributions. Some of 
the most significant time series are composed of observed measure- 
ments or registrations of a variable element in a single object at 
different times. Thus, a time series may report the velocity of 
the wind at a weather observatory at noon on indicated days. 
Another series may state the opening daily quotations (or prices) 
of some industrial stock on the New York Stock Exchange over 
a period of six months. Still another may express averages, or 
relatives, or ratios, of items more directly observed. Thus, 
the individual items of a series designed to show the course of a 
copper stock on the Boston Exchange may be expressed in terms 
of weekly averages of the high and low quotations; or a series 
of price relatives may give the price of cotton for the middle of 
each month as a percentage of the average monthly price for the 
year 1913. It is clear, then, that time series assume a number 
of forms. Some result from direct measurement; others from 
tabulation of individual cases; still others from mathematical 
calculations of one sort or another. All the forms are alike, 
however, in expressing the magnitudes of a variable as a function 
of time. 

The steps involved in the development of a time series have 
already been considered in connection with the formulation of the 
temporal distribution. No new principles governing class inter- 
vals, class limits, or class designations appear in the formulation 
of time series not distributive in character. True, if the items 
are not aggregates, inequality of intervals is not as bothersome 
in the analysis of the variable as it is in the case of temporal 
distributions. Thus, we may satisfactorily represent price data 
which are reported at dates irregularly spaced. There is a strong 
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presumption, however, in favor of equal class intervals whether 
or not the series be aggregative in character. In brief, it may be 
said that the rules already developed’ for the formulation of 
temporal distributions hold in general for the development of time 
series. 

The same is true of the rules for graphic representation of 
time series. As in the case of temporal distributions, intervals 
of time are to be laid off horizontally from left to right; units of 
the dependent variable — whatever it may be — vertically from 
bottom to top. Perhaps the argument in favor of the line chart, 
rather than the bar diagram, is stronger in the case of time series 
which are not distributive in character than in the case of those 
which are; since, presumably, the dependent variable in such 
series usually has values between those actually plotted. In 
general, however, the same system of plotting and of graphic 
representation is applicable to all varieties of time series. 


CHART 53. BRADSTREET’S MontHty INDEX oF WHOLESALE PRICES IN THE 
UNITED STATES, 1911-1925 
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Source: Numerous issues of Bradstreet’s. 


1 See Chapter VIII. 
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This system has already been explained at length. The general 
principles to be observed will be sufficiently indicated at this 
point if a single illustration is given in standard form. This is 
afforded in Chart 53. 

Analysis of time series may well begin with the recognition of a 
number of distinct types of movement. These sometimes appear 
separately in different series, sometimes, merged in a single 
series. It will simplify the description of the types if they are 
first considered in isolation. 

In general, movements in time series are to be classified on 
the basis of (a) the length of the period in which the movement 
completes its course, (b) the characteristic configuration of the 
movement, and (c) the regularity with which the movement 
appears. With reference to these points, movements may be 
described as 

r.._Evolutionaryimovements, or long-time trends, showing 

persistent rise or fall over a longer period of time. 

2. Periodic movements showing a definite recurrence of fluc- 

“tuation in given periods, or at stated intervals. 

3. Undulatory or cyclical movements, wave-like in character 

but lacking clear periodicity. 


4. Episodic_ movements due to specific causes, ordinarily re- 
flected in sharp, pronounced breaks in the record of the 


variable. (These exhibit no apparent tendency toward 
recurrence at stated intervals.) 


5. Fortuitous or accidental movements, of unknown origin, 
quite nregular in character, but involving only minor 
disturbances of the general course of the variable. 


These types of movement call for close examination. They will 
first be more fully described and then in subsequent chapters 
analyzed at length. 

Evolutionary movements or secular trends are encountered in 
the records of a great many economic and business factors. 
This type of movement constitutes the long-time swing or drift 
of the variable. The movement is readily illustrated in the 
following concrete examples ; 
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TABLE 75. POPULATION OF THE UNITED STATES (EXCLUSIVE 
OF OUTLYING POSSESSIONS), 1790-1920 


-B 


Census: YEAR POPULATION 
1799 3,929,214 
1800 5,308,483 
1810 7,230,881 
1820 9,638,453 
1830 12,866,020 
1840 17,069,453 
1850 23,191,876 
1860 31,443,321 
1870 38,558,371 
1880 50,155,783 
1890 62,947,714 
1g00 75994,575 
IQIO 91,972,266 
I920 105,710,620 


Source: Fourteenth Census of the United States, 1920, vol. I, p. 14. 


CHART 54. POPULATION OF THE UNITED STATES (EXCLUSIVE 
OF OUTLYING POSSESSIONS), 1790-1920 
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TABLE 76. Imports OF COFFEE INTO THE UNITED STATES, 
BY YEARS, IQI0-1919 


(Unit: 1,000,000 Ibs.) 


YEAR COFFEE 
IgIo 871 
IQII 875 
IQI2 885 
1913 863 
1914 1002 
1915 111g 
1916 1201 
1917 1320 
1918 1144 
1919* 1334 


* Calendar year, other years are fiscal. 


Source: Review of Economic Statistics, prel. vol. II, p. 132. 


CHART 55. Imports OF COFFEE INTO THE UNITED STATES 
BY YEARS, IQI0O-IQIQ 
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Trends or long-time movements may be either upward or 
downward, denoting either increase or decrease, expansion or 
contraction, of the factor in question.1 Obviously, trends may 
assume a great diversity of forms. The drift of the variable 
may be along a straight line, or along some simple, or some 
more complex, curve. The one essential feature of the trend is 
that it operate over a longer interval of time, and be not re- 
petitive in character. It is the underlying persistent movement 
of the variable. 

Periodic movements are another common type in economic and 
business series. One of the most significant periodic movements 
is seasonal variation. The nature of seasonal variation is evident 
from the designation itself: there is in the series a definite round 
of high and low points associated with the recurrence of the 
seasons. Thus, in many lines of business, mid-winter and mid- 
summer are both periods of relative quiet; spring and _ fall, 
both periods of marked activity. As an illustration of strong 
seasonal variation, observe the fluctuations of kilowatt hours 
sold for lighting by the Public Service Electric Company of New 
Jersey during the years 1910-1923. (See Chart 56.) Clearly 
the movement of this variable is markedly seasonal. 

Seasonal variation is really but one representative of a whole 
family of periodic movements. The course of a series over the 
hours of the day, or the days of the week, may show recurrent 
points of maxima and minima. Such points are to be found, for ex- 
ample, in the number of sales made in a retail store, in the number 
of passengers carried on a transit system, in the number of calls 
handled at a telephone exchange. The curve shown in Chart 57 
(see page 240) gives an interesting illustration. Examples of this 
sort of short-time periodic movement might be readily multiplied. 

Cyclical movements are similar to the periodic in that they are 
definitely recurrent, but they are dissimilar in that they lack clear 
periodicity. Take, for example, the oscillations of the business 
cycle. (See Chart 79, page 307, for the plot of a curve of this 


1Tf the change is in the direction of increase or expansion, the movement is commonly 
referred to as the “growth element.” 
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type.) The fundamental difference between the longer recurrent 
movements of the cycle and the shorter ones of the seasons of 
the year, days of the week, or hours of the day is that the recur- 
rence of the fluctuations does not rest upon as definite a physical 
basis. Some claim to find a definite eight- or eleven-year perio- 
dicity in the movements of crop yields in important agricultural 


CuHaArRT 56. Kitowatt Hours Sortp For LIGHTING BY THE PUBLIC SERVICE 
ELECTRIC COMPANY, I910-1923 


(Logarithmic Scale.) 
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Source: H. B. Vanderblue and W. L. Crum, ‘‘The Relation of a Public Utility to the 
Business Cycle,”’ Harvard Business Review, vol. III, no. 1, p. 9. 


countries. But in general it is fair to say that, though there is no 
difficulty in seeing why the manufacture of clothing should show 
seasonal variation, the reasons for the regular fluctuation of general 
business conditions are not as obvious. Certainly the causes 
are not as clearly physical in origin. It is not surprising, 
therefore, that the fluctuations should not be as regular in period. 
This lack of definite periodicity has important consequences for 
statistical analysis, as will appear later. It gives warrant for 
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treating cyclical fluctuations in a somewhat different category 
from seasonal variations. 

Episodic movements have, by nature, no characteristic form. 
They ordinarily have their origin in some specific cause which 
may operate to give the movement of the variable temporarily 
almost any form. A strike may cut the activity of a trade to a 
bare minimum or bring about a complete cessation of operations. 
The example given below (Table 77 and Chart 57) shows the 
course of a variable interrupted by an industrial dispute. 


TABLE 77. CALLS ON TELEPHONE EXCHANGE DURING MONTH OF JUNE, I9— 


Sunpay | Monpay | Turspay |WEDNESDAY| THuRSDAY| FrRmay | SATURDAY 


First Week 179,974 | 161,133 
Second Week | 83,216 | 184,749 | 178,898 | 171,516 | 186,425 | 184,209 | 161,793 
Third Week | 83,278 | 184,423 | 183,657 | 184,505 | 186,819 | 194,034 | 166,618 
Fourth Week | 84,222 | 161,752 | 189,333 | 186,743 | 192,086 | 195,115 | 171,774 
Fifth Week 77,772 | 179,437 | 118,265 | 119,299 | 124,860 | 131,465 | 116,439 


CHART 57. CALLS ON TELEPHONE EXCHANGE DURING MONTH OF JUNE, I9— 
Thousand Calls 


1234567 89 1011 l2 13 14 15 1617 1819 2021 22 23 24 25 26 27 2829 30 
FSSMTWTFSSMTWTFSSMTWTFSSMTWTFS 
June, /9-- 


THE NATURE OF TIME SERIES 241 


In the chart the drop of the curve from June 26-30 indicates the 
extent to which the volume of business was affected by a strike 
of the telephone operators. Movements of this sort can hardly 
be expressed in any definite mathematical functions. They 
constitute unique cases which are to be examined with special 
reference to the conditions prevailing in the particular instance. 

Fortuitous or accidental movements are the result of irregular 
disturbances having their origin in a multitude of small acciden- 
tally distributed influences which lie individually beyond observa- 
tion. Take, for instance, the following record of rainfall during 
May for the period of 46 years from 1871 to 1916. The data 
are given in both Table 78 and Chart 58. 


TABLE 78. PRECIPITATION FOR THE MontH oF May IN EAcH YEAR, 1871-1916, 
NASHVILLE, TENN. 


YEAR INCHES YEAR INCHES YEAR INCHES 
1871 3.30 1886 2.10 IQOI Aisi 
1872 3.09 1887 3.43 Igo02 4.36 
1873 6.31 1888 2.97 1903 5-64 
1874 I.49 1889 5.00 1904 2.07 
1875 ul Wfes 1890 4.16 1905 6.00 
1876 3.40 1891 2.39 1906 3.80 
1877 1.25 1892 4.03 1907 6.01 
1878 DY Be 1893 Tren 1908 2.80 
1879 2.88 1894 DEY 1909 4.57 
1880 4.13 1895 2.05 IQIO 5.81 
1881 3.67 1896 4.05 IQII 1.67 
1882 7.38 1897 22 IQ12 4.02 
1883 4.82 1898 1.80 1913 2.66 
1884 3.58 1899 3.36 IQI4 3.01 
1885 4.36 1900 1.86 IQI5 4.904 

1916 GES, 


Source: Tennessee State Geological Survey, The Resources of Tennessee, vol. VIII, 
no. 1, table X, pp. 40-41. 


Clearly the fluctuations of the variable here are quite erratic. 
Movements of this sort are present in a great many time series, 
though generally, because merged with movements of other 
kinds, not in as clear form as in this illustrative case. 
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CuHarT 58. PRECIPITATION FOR THE Mont oF May In Eacu YEAR, 1871-1916, 
NASHVILLE, TENN. 


The several varieties of movement just described rarely occur 
singly. Commonly two, three, or four varieties of movement are 
merged in a single series. Consider the illustration given in 
Table 79 and Chart 59. (See pages 243 and 244.) The general 
upward drift of the curve is unmistakable. There is, in other 
words, a strong positive trend. There is, furthermore, a definite 
tendency for high points to occur in the spring and fall of each 
year, indicating a seasonal variation. Again, there is a tendency 
for recurrent peaks and troughs over the years, suggesting a 
cyclical fluctuation under the influences of general business con- 
ditions. The episodic movements of early 1918 and late 1919 are 
unmistakable, — the railroad congestion in the winter of 1918 
and the steel strike of the fall of t919 account for these. Small 
irregular fluctuations having no assigned cause, appear through- 
out the series. The series as a whole thus exhibits a combination 
of a wide variety of movements. 

The task of statistical analysis when confronted with such a 
series is twofold: (r) the different varieties of movement must be 
isolated, that is, separated from one another so that each may be 
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dealt with by itself, and the contribution of each to the total 
result in this fashion ascertained; and (2) the character of the 
individual movements must be carefully determined.! This 
requires skillful adaptation of the course of analysis to the nature 
of the movement with which the analysis is concerned. The 
two phases of investigation must be kept clearly in mind, though 
they are so closely related that analysis will commonly deal with 
both at the same time. 

The several varieties of movement having been described, 
each may now be subjected to special study. The determination 
of trends will first be considered. This will be followed by an 
examination of movements of a periodic character. Finally, a 
study will be made of the residual fluctuations, or of the deviations 
from the norms established in the preceding analysis. Before 
undertaking this line of study, however, it will be well to con- 
sider another aspect of time series, namely, increments and rates 
of change. These features of the time series will constitute the 
subject matter of the next chapter. 


1A time series does not ordinarily invite the derivation of a typical average since the 
variable exhibits no “ central tendency ” for the period as a whole. 


CHAPTER XVI 
INCREMENTS AND RATES OF CHANGE 


One of the most important questions that can be raised about 
a time series is: What is the magnitude of the changes exhibited 
by the variable from one period to another? In answering this 
question the changes may be regarded either absolutely or rela- 
tively. Absolute differences may be referred to as increments 
of change; relative differences as rates of change. The signifi- 
cance of these elements of change in a time series will be evident 
if a concrete illustration is examined. 

The population of the United States (exclusive of the outlying 
possessions) has been reported at successive decennial censuses 
from 1790 to 1920. The data are given in Table 80, column (a) : 


TABLE 80. POPULATION GROWTH OF THE UNITED STATES (EXCLUSIVE 
OF OUTLYING PoOssESSIONS),* 1790-1920 


INCREASE OVER PRECEDING CENSUS 
Census YEAR POPULATION 
(in millions) ASRS 
(in millions) PER CENT 
(a) (b) (c) 

shes 3-9 — zh 
1800 5.3 v4 ee 
1810 7.2 7.6 Be 
1820 9.6 Py ee 
1830 12.9 3.2 ae 
ate ea 4.2 32.7 
1850 23.2 Gu Fees 
1860 31.4 8.3 Bar 
vie 39.8 8.4 26.6 
ie poe 10.3 20.0 
1890 62.9 12.8 25.5 
Igoo 76.0 13.0 ane 
ae 92.0 ° 16.0 21.0 
1920 105.7 13.7 14.9 


* Corrected figure for 1870. 


Source: “ Increase of Population in the U. S., 1910-1920,” Census Monograph I, p. 21. 
246 
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In columns (0) and (c) of this table are given the absolute and 
relative increases of each count over the preceding count. The 
absolute increases appear in millions of population; the relative 
increases, in percentages. The table as a whole affords a complete 
picture of the population growth of the country during a period of 
one hundred and thirty years. 

Undoubtedly, if an analysis is being made of the growth of the 
population of the United States, interest in some connections will 
run largely to the aggregate population reported at the ten-year 
intervals. This aggregate is the original or primary variable; 
the increments and rates of change are derived or secondary. 
Effective presentation of the primary variable as originally given 
commonly constitutes one of the most important phases of the 
analysis. 

At the same time, in other connections it may be just as 
important to know what have been the absolute increases or 
decreases in the primary variable from one period to another. 
Thus, in a population study, the actual number of inhabitants 
added from one census to another may be of great significance 
in connection with the calculation of the added productive power, 
or the new military strength, of the nation. The absolute 
increases (or decreases) of the original variable are readily found, 
of course, by subtracting each term of the original series from the 
term which follows. The items thus secured are commonly called 
“first differences.” They constitute an important element in 
the analysis of the variable. 

Instead of considering the change of the variable absolutely, 
however, it may be important in still other analyses to relate the 
change in each period to the size of the variable at the beginning 
of the period. Referring again to the population study, it is 
obvious enough that the increase of 1,400,000 in the decade from 
1790 to 1800 is in many ways more striking than the increase of 
13,000,000 in the decade from 1890 to 1900; the smaller absolute 
increase of the earlier decade represents relatively a much more 
rapid growth of the population of the country. In some ways 
column (c) of Table 80 is the most significant and striking column 
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of all; it discloses unmistakably the declining rate of growth 
of population in later years. Relative changes in a time series 
are not infrequently the most significant feature of the series. 
Changes may sometimes be best given, therefore, not as absolute 
increments or decrements but as percentages or proportions of 
increase or decrease; in other words, as rates of change. 

If, in the analysis of a time series, attention 1s to be focused 
upon any one of the three features just mentioned, to the exclusion 
of the others, it is best to give directly and separately the appro- 
priate variable element, whether it be the primary variable, 
the derived increment of change, or the derived rate of change. 
Thus, to take as an illustration the last of these three, it is obvious 
that nothing could show more clearly than Chart 60 the fall in 
the rate of growth of the population of the United States during 
more recent years. If interest runs solely or very largely to this, 


CHART 60. PERCENTAGE INCREASE OF POPULATION OF THE UNITED 
STATES DURING INTERCENSAL PERIODS, 1790-1920 


Per Cent 
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rather than to any other, feature of the primary variable, the 


simple graphing of this single factor is preferable to any other 
form of presentation. 
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At times, however, interest is not confined to a single element 
of variation. It may then be desirable to display in a single 
chart at least two features of the variable. Suppose it is desir- 
able to show an aggregate, and the increments of change in the 
aggregate, for successive periods. This object may be attained 
in a simple plot of a time curve on natural scale. An illustration 
is given in Chart 61. It is relatively easy to obtain from such a 
plotting a clear notion of the course of the variable, as well as of 
the absolute changes of the variable from one reporting period to 
another. 


CHART 61. POPULATION OF THE UNITED STATES, 1790-1920 
Millions 


Increment 
1880-90 
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Suppose, on the other hand, that it is desirable to show in a 
single diagram the course of the original variable and the rate of 
its change from one period to another. Satisfactory results are 
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Cuart 62. PopULATION OF THE UNITED STATES (EXCLUSIVE OF 
OUTLYING POSSESSIONS), 1790-1920 


Millions (Logarithmic Scale) 
200 
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not to be had in this case from a chart drawn to natural scale. 
It is quite impossible to translate into relative terms by direct 
visual reading the absolute changes shown in the chart. To 
1 For a geometric device for measuring relative change in a curve plotted to natural scale 


see A. Marshall, ‘‘On the Graphic Method of Statistics,” Jubilee Volume, Royal Statistical 
Society, 1885, pp. 251-260. 
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present effectively in a single curve the course of a variable and 
the rates of its movement, it is necessary to resort to the use of the 
logarithmic or ratio scale.! 

In a chart drawn to logarithmic scale, equal vertical distances 
correspond to equal proportional changes. The population 
growth from 1790 to 1920 is drawn to logarithmic scale in Chart 
62. Here the reading on the vertical scale of the total population 
at any date is comparatively simple. At the same time the 
variable slope of the curve gives a direct visual indication of the 
changes that have taken place in the rate of population growth. 


TABLE 81. ANNUAL PRODUCTION OF CIGARETTES IN THE UNITED STATES 
BY YEARS, 1899-1919 


Vaan aa tation Loseiiiaes oF Peon ucrean 
1899 3-74 5729 
1900 3.25 .5119 
IQOI Bie 4346 
1902 2.96 4713 
1903 3.36 5263 
1904 3-43 +5353 
1005 3.07 5047 
1906 4.50 6532 
1907 5-26 7210 
1908 5.74 +7589 
1909 6.82 8338 
1910 8.64 9305 
IQIi 10.47 1.0199 
Igi2 rae 7 T.1196 
1913 15.56 1.1920 
1914 16.86 1.2269 
1015 17.96 1.2543 
1916 25.29 1.4029 
1917 35-33 1.5481 
1918 37.89 1.5785 
1919 44.77 1.6510 


Source: Review of Economic Statistics, Nov., 1920, Table XXI, p. 320. 


10Qn the subject of logarithmic curves, two articles will be found particularly helpful: 
(1) J. A. Field, “Some Advantages of the Logarithmic Scale in Statistical Diagrams,” 
Journal of Political Economy, October, 1917, pp. 806-841; (2) Irving Fisher, “‘ The Ratio 
Chart for Plotting Statistics,’ Quarterly Publications of the American Statistical Association, 


June, 1917, pp. 577-601. 
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A logarithmic plot of a time series may be made in either of two 
ways. In the first place, the logarithms of the values of the 
variable may be plotted to natural scale. This method is illus- 
trated in Chart 63 on the data of Table 81. 


CuarT 63. ANNUAL PRODUCTION OF CIGARETTES IN THE UNITED STATES, 
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The difficulty with this kind of chart is that the logarithms 
corresponding to values of the variable other than those repre- 
sented in the plot are not evident from the chart. Furthermore, 
it is bothersome to have to ascertain the logarithms of the indi- 
vidual items of the series. 

Instead of plotting the logarithms of the numbers to natural 
scale, it is better ordinarily to plot the actual numbers to logarith- 
mic scale (see Chart 62). Such a logarithmic scale is readily 
obtained by marking off on the vertical axis of a coérdinate field 
a set of numbers at distances corresponding to their logarithms. 
The latter may be secured, of course, from any table of logarithms 
(see Appendix F).! The grid is then made up by drawing in the 


1 For some purposes the logarithmic spacing may be obtained with sufficient accuracy 
by marking it off from the divisions on a common slide rule, 
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horizontal lines with this logarithmic spacing, the vertical lines 
being evenly spaced as in other charts.!. With such logarith- 
mically-ruled paper in hand, a plot of any time series on logarith- 
mic scale may be obtained in simple fashion. 

A rule of standard graphic practice should be noted in con- 
nection with the preparation of a semi-logarithmic grid: a power 
of ten should appear at the bottom of the vertical scale.2 The 
scale may cover the range of only one power of ten — one “‘cycle,” 
say, from ro! to 10? (i.e. from 10 to 100); or it may run through 
two cycles — say, from 10! to 10% (i.e. from 10 to 1000) ; or even 
through four cycles, as from 1o! to 10° (i.e. from 10 to 100,000). 
The scale is said to be in as many cycles as there are powers of 
ten covered between the bottom and top of the vertical scale. 

As already indicated, in a chart drawn to logarithmic scale 
equal vertical distances correspond to equal ratios or equal 
proportional changes.* Thus the difference between 2 and 4 on 
logarithmic scale is equivalent to the difference between 4 and 8, 
or 8 and 16. Equal vertical differences represent multiplication 
of successive values by a constant factor. In the case of the 
population growth the difference vertically between 3,000,000 
and 6,000,000 is equivalent to the difference between 50,000,000 
and 100,000,000. 

It follows that a straight line on logarithmic scale represents 
change at a constant rate. If a curve on a logarithmic graph is 
concave to the X-axis, the rate of increase is declining (or the 
rate of decrease mounting). If the curve is convex to the X-axis, 
the rate of increase is mounting (or the rate of decrease declining). 
Parallelism in lines on a logarithmic plot reflects change at the 
same rate in the variables represented by the lines. Curves on 
logarithmic scale thus disclose in simple fashion the more significant 
features of the rates of increase or decrease of plotted variables. 

1 Paper ruled in this fashion is commonly referred to as semi-log or arith-log paper to 
distinguish it from paper on which both horizontal and vertical lines are spaced logarith- 
mically — commonly called ‘‘log paper.’ Both log and semi-log (or arith-log) papers are 


prepared for sale by a number of makers of special drafting forms. 
2 It is common practice to have the upper terminal of the vertical scale also some power of 


ten. 
3 For this reason a plot to logarithmic scale may be appropriately called a “ratio chart.” 
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A clear idea of the meaning of a logarithmic plot may be had 
from a direct comparison of the same variable charted on natural 
and logarithmic scales. Such a direct comparison is shown in the 
two figures of Chart 64. 


CHART 64. GROWTH OF INVESTMENT AT CONSTANT RATE OF RETURN, 1870-1910, 
SHOWN ON ARITHMETIC AND LOGARITHMIC SCALES 


A B 


Arithmetic Scale Logarithmic Scale 


1370 1880 1890 1900 1910 1870 1880 1890 1900 1910 

Source: John Wenzel, “Graphic Charts,” Scientific American, Supplement No. 2154, 
p. 236. : 

It should be noted that the exact inclination of a curve on 
logarithmic scale depends upon the arbitrary relationship be- 
tween the horizontal and vertical scales adopted in the construc- 
tion of the chart, as well as upon the rate of change of the variable 
represented. Furthermore, the angles made with the horizontal 
axis by different straight lines on logarithmic plot are not to be 
thought of as proportionate to the rates of change represented 
by the different lines. In Chart 65 the inclination of lines repre- 
senting different rates of growth is shown. The exact rate of 
change represented by a line on a logarithmic plot is to be deter- 
mined only by careful examination of the figure.! 


1 It is helpful sometimes to introduce in a logarithmic plot a guide line showing constant 
proportionate increase at some designated rate. 
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The most important advantage of the logarithmic plot is that 
it represents in simple graphic form the facts of relative change in 
the variable, while leaving clearly in the picture the original 
values of the variable itself. This is a very considerable advan- 
tage since proportionate rather than absolute changes are to 


CHART 65. Lines SHOWING ON LOGARITHMIC SCALE INVESTMENTS INCREASING 
AT CONSTANT RATES 


Amount 
invested 


Increase 
Per annum 


be emphasized in a great many analyses. The curves shown in 
Chart 66! afford an illustration. The chart serves to bring out 
the fact that relative changes from month to month are much the 
same in the three years, though the absolute changes are quite 


1The data, for which I am indebted to Mr. R. B. Prescott, are given in the table ap- 
pearing below the chart. 
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different. The use of the vertical logarithmic scale is to be espe- 
cially recommended when the differences of size in the items of the 


Cart 66. MontTHty RETAIL SALES OF AUTOMOBILES, JANUARY, 1922 
Thousands To DECEMBER, 1924 
40 


30 


‘20 


| 
Jan Feb. Mar. Apr. May Juhe July Aug Sept. Oct. Nov Dec. 


variable are great, so that a diagram drawn to natural scale 


seriously distorts impressions of the extent of the relative changes 
of the variable. 


TABLE 82. Montaty Retain SALES OF AUTOMOBILES, JANUARY, 1922 
TO DECEMBER, 1924 


NuMBER OF Cars SOLD 


Monta 

1922 1923 1924 
AERO A” See ew SaaeCeeenOw iT eka an 2,380 4,597 6,840 
Bebruanyen en aa a! os a ee 1,968 4,414 6,486 
March 3,416 8,162 10,925 
Apnl . 5464 13,472 17,201 
May . feo oye tee CeO. on 6,742 16,0604 16,504 
i AULUYoe te ae ae CR ae _ ee 6,286 15,308 12,629 
APL eae ais, Rees ue ee eat ee 6,826 14,057 14,407 
August ee a ene ee et 5,072 12,704 12,116 
September is wale ae ust ee he 5,388 9,504 7,541 
October ne er ke oe at 5,051 0,579 7,700 
November ceude ics We ee Ee 2,505 5,692 6,383 
December 


1,899 3,645 4,230 
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Another advantage of the logarithmic plot is to be noted: the 
logarithmic scale has a much greater serviceable range than the 
natural scale. Variables that differ greatly in magnitude may be 
satisfactorily represented on logarithmic scale. Thus, on a 
four-cycle logarithmic scale it is possible to get a clear plot of 
values as small as, say, 15 to 20 and values as large as 85,000 to 
95,000. This manifestly would be impossible on any natural 
scale, since, if the scale were so drawn as to distinguish between 
85,000 and 95,000, a difference as small as that between 15 and 
20, or even 50 and too, could not possibly be shown in the figure. 
For the plotting, then, of variables which differ greatly in magni- 
tude, the logarithmic scale is particularly useful. 

The principal difficulty in the use of the logarithmic plot lies 
in the fact that many persons are unfamiliar with logarithms. 
Readers become confused by the unevenly spaced horizontal 
lines which appear in a logarithmically-spaced grid. This 
difficulty can be offset in part by including the actual data along- 
side of the plots, and giving comparatively few of the horizontal 
scale-lines. If the figures of the vertical scale are shown only at 
wide intervals, the nature of the relationship between different 
points on the vertical scale appears to be more easily understood 
by those unfamiliar with logarithms. By such practices, the 
logarithmic scale may be made more generally acceptable, and 
its meaning gradually taught to a larger number of readers. 

In general, the logarithmic plot has very important virtues. 
Doubtless it should be used more than it has been. It is not to 
be employed universally, however, for there are times when 
increments of change are more important than rates of change. 
Where rates of change are of prime interest the desirability 
of using the logarithmic plot should always be carefully con- 
sidered. 

In general, the examination and presentation of changes in a 
variable, either in absolute or relative form, frequently con- 
stitutes an important phase of statistical analysis. The several 
possible lines of treatment described in this chapter are all to be 
regarded as effective means available for this purpose. 


CHAPTER XVII 


EVOLUTIONARY MOVEMENTS: SECULAR TREND 


Ir was pointed out in Chapter XV that time series commonly 
display several varieties of movement. These were broadly 
divided into five categories: evolutionary movements; periodic 
movements; cyclical movements; episodic movements; and 
accidental movements. The present chapter deals with the first 
of these. 

Evolutionary movements are commonly referred to as trends. 
They possess the general characteristic of persistent movement 
in a given direction, this movement having the effect of changing 
the level upon which the shorter movements of the variable 
occur. Trends constitute one of the most important types of 
movement in time series. Their nature, and the methods by 
which they are to be measured, must be carefully considered. 

It will be helpful at the outset to cite a number of concrete 
illustrations of evolutionary movements. The dash lines in 
the figures of Chart 67 are the trends of the different series repre- 
sented. It is clear that these trends assume different forms and 
represent both upward and downward tendencies. One general 
notion is represented in each case, however: the notion of a 
persistent movement of the variable over a longer period of time.” 

In the study of trends, the first step should ordinarily be the 
plotting of the series in full. As a rule, it is not necessary to 
plot the more detailed data: annual averages or aggregates, or at 


1In this connection see W. I. King, “Principles Underlying the Isolation of Cycles and 
Trends,” Journal of the American Statistical Association, December, 1924, pp. 468-475. 

2 What constitutes a “longer period of time” has to be settled somewhat arbitrarily in 
the light of the particular investigation in hand. If one is making an analysis of fluctua- 
tions in the hourly service of a telephone exchange, differences extending over a period of 
months may be regarded as relating to a “‘longer period of time.”’ In most cases, however, 
trends relate to periods that run over at least several years, and in some investigations over 
several decades, 
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most, corresponding monthly figures, will suffice. More detailed 
data running to shorter periods like weeks and days merely com- 
plicate the diagram and make the visualization of the trend some- 
what more difficult. Ordinarily a natural scale will serve best in 
this preliminary plotting of the variable, but if there are indica- 
tions that the variable has shown a steady rate of increase or 


CHART 67. TRENDS OF AGRICULTURAL PRODUCTION IN THE UNITED STATES, 
1879-1919 
WHEAT WHITE POTATOES 


Million Bushe/s Million Bushels 
1000 500; 


0 ie) 
1879 1885 1890 1895 1900 1905 1910 1915 1919 1879 1885 1890 1895 1900 1905 IHO 1915 1919 


RICE (ROUGH) FLAXSEED 
Million Bushels 
30 


0 
1879 1885 1890 1895 1900 1905 1910 1915 1919 79 1885 1890 1895 1900 1905 1910 1915 W919 


Source: Edmund E. Day, ‘‘ An Index of the Physical Volume of Production,” reprint 
from Review of Economic Statistics, pp. 6-7. 


decrease, a logarithmic scale may be used instead. The impor- 
tant thing is to obtain a clear impression of the general course 
of the variable over the period in question. 

The preliminary plot should indicate fairly definitely whether 
or not the series exhibits any trend. If a trend appears to be 
present, the next step is to determine as accurately as possible 
the form of the movement. Three different processes are em- 
ployed for the purpose: (1) the freehand drawing of a line of 
fit ; (2) the method of moving averages (or “progressive means’’) ; 
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(3) the method of mathematical curve-fitting. These methods 
are entirely distinct and require separate consideration. 

In securing a line of best fit by the method of freehand drawing, 
the problem is to obtain a smooth line which conforms to the 
general drift of the series and preserves equal areas above and 
below the line of trend — areas described by the line of trend 
and the plot of the original series — for the period for which the 
fitis made. The stretching of a thread or string across the plot 
of the series facilitates the final location of the trend by this 
method if the trend may be assumed to be in the form of a straight 
line. Facility in the freehand drawing of lines of fit is fairly 
readily acquired. 

_ This method of graphic approximation is entirely satisfactory 
for obtaining a general visual impression of trend. Objection to 
the method arises largely from the fact that it yields no exact 
objectively-obtained numerical equivalents for the trend-line. 
Of course, actual readings may be taken from the line as drawn, 
but these are at best rather crude and are by no means as satis- 
factory as the data obtained under the other two methods of de- 
termining the trend. In general, the method of freehand drawing 
is to be employed only for the purposes of a preliminary re- 
connaissance of the data. ! 

The method of the moving average is more important.1 A 
moving average consists of a series of averages computed from a 
given number of the items of a series, each successive average 
depending upon a set of items in which an earlier item has been ° 
dropped and a later item added.? Thus, suppose we have the 
series of monthly sugar meltings shown in column (a) of Tablé 83. 
The five months’ moving average of this series is obtained by first 
taking the average of the months January to May, 1919, next the 


1 The method of the moving average is pertinent to any problem of smoothing the fluctua- 
tions of a variable. The process of taking an average naturally eliminates many of the varia- 
tions which appear in the original items. The moving average is to be considered in the 
present connection primarily as a means of smoothing series which exhibit a definite periodic, 
or clear wavelike, movement. 

2 Ordinarily the arithmetic mean is employed. Of course, some other mean — say, the 
median — may be used if there are special reasons for so doing. In the discussion in the 
text, use of the arithmetic mean will be assumed. 
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average of the months February to June, and so on through to the 
average of the months from August to December, 1921, inclusive. 
The moving sums and successive averages of the items of column 
(a) appear in columns (6) and (c) respectively. The moving aver- 
age may be cited either against the month which stands at the 
middle of the period for which it is computed, or against the month 
which immediately follows the period for which it is computed ; 
or, as a matter of fact, in other ways, provided the precise rela- 
tionship between the moving average and the period for which it 
is cited is definitely and unmistakably indicated. As a rule, 
however, moving averages are centered: in other words, they 
are quoted for the point in time which lies at the middle of the 
period for which the averages are calculated.1 

Computation of the moving average involves two steps: 
(x) successive summation of the items from which the series of 
averages is to be derived; (2) computation of the averages from 
these ‘‘moving”’ sums through division by the number of items 
included in each. In obtaining the moving sum, it is desirable 
to use an adding or calculating machine which will permit the 
derivation of the next sum through the elimination of the earliest 
item of the previous group of items and the inclusion of the next 
item to be added.? The use of a calculating machine which sub- 
tracts as well as adds directly makes this an easy process. With 
the moving sums in hand, the calculation of the moving average 
is a comparatively simple matter.’ 

Centered moving averages obviously do not extend the full 
length of the original series. The first and last individual aver- 
ages are each half a span or wave-length away from the terminals 


1 Since this is the common practice, it is somewhat simpler to compute moving averages 
for periods which include an odd number of items. The moving average then takes the 
place of the middle item of the set from which it has been computed. If the moving average, 
on the other hand, has been computed from an even number of items, it has to be centered 
against the point in time which lies halfway between the two middle items of the set from 
which the average has been computed. While this slightly complicates the record, it is not 
to be thought of as a serious objection to the derivation of moving averages from an even 
number of items where other considerations seem to make this desirable. 

2 The process of picking out the items to sum is facilitated by cutting a slot in cardboard 
just long and wide enough to permit the reading of the proper number of items. 

3 It somewhat simplifies the process, of course, to have the moving average an average of, 
say, five or ten or some other number of items for which division of the sums is easy. 
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TABLE 83. SUGAR MELTED AT ATLANTIC Ports, BY MONTHS 
JANUARY, 1919 TO DECEMBER, 1921 


(Unit: tooo long tons) 


Mont Onenar Frat |) 2 Sor caccers \dyeetce Gnvaeae 
(a) (0) (c) 
Tio en@) iewabeNayy 5 5 5 ¢ 147 ——— 
IRSA GG 6G 220 
Marcher 3) pee. 261 1221 244.2 
Moe | W Go Gy ae 277 1387 277.4 
Maye a gs Baeuls 307 1450 290.0 
JUNCEe Senses S Bi 1418 283.6 
tly ae eee, ve 292 1433 286.6 
ANUESUISE 5 9 Go © 220 1342 268.4 
Seouslnse 4 = - 292 1206 241.4 
October me 216 1041 208.2 
IN@VOMHT, 5 og = 7H 993 198.6 
Ween crane ay 970 194.0 
ewe) JEWRY 5 fo 6 181 1087 217.4 
InearueAy 4 5 5 6 269 T217, 243.2 
IMEI 5 yee 333 1376 275.2 
PADI Meteo) 08 ke 307 1514 302.8 
INES Mes) eae culee 286 1570 314.0 
ACER ee sten 319 1524 304.8 
Ulyoeae oe os oe 325 1381 276.2 
INURE: = 6 ~g 9 ¢ 287 1273 242.6 
September .. . 164. 1073 214.6 
Oeil nian ease 118 : 902 180.4 
INOW, 4 5 < 179 709 141.8 
IDeearlree.. 6 3 « 154 738 147.6 
n@am Vepmbenay 4 5 5 « 94 930 186.0 
IMO 5 6 5 < 193 983 196.6 
Manchoeneean 5) ace 310 1005 213.0 
Aprilia sence 232 1188 237.6 
IVa yee coo Rel os 236 1216 243.2 
UNG wero sa em oy) 1220 244.0 
Atlee OP 221 1180 236.0 
INOUE ovo c 314 1138 227.6 
September) eas 192 1124 224.8 
Octobenmanm. aan 194 Io81 216.2 
INovember. 3: 203 
December. . . ; 178 


Source: “Cyclical Fluctuations of the Volume of Manufacture,” Review of Economic 
Statistics, January, 1923, Dp. 51. 
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of the original series. When it is necessary to have the moving 
average extend to either end of the original series, this result can 
be obtained by freehand graphic extrapolation of the plot of the 
moving average, or by calculation of the moving average for the 
terminal periods on the assumption of a repetition of the first and 
last items in the original series, or on the assumption of constant 
increments (or decrements) or a constant rate of increase (or 
decrease) beyond the points for which the moving average can 
be directly calculated. None of these methods, however, is 
altogether satisfactory, and the failure of the moving average to 
cover the full period of the original series remains one of its more 
serious defects. 

When a moving average is employed to determine the trend of a 
series which is exhibiting a definite wavelike fluctuation, the first 
step is to ascertain the characteristic wave-length. The wave- 
length is nothing more, of course, than the interval from the crest 
of one movement to the crest of the next; or, to put the matter 
more generally, from any particular phase of one movement to 
the same phase of the next movement. The wave-length can be 
ascertained most readily from a study of the plot of the original 
items. Itis this wave-length which should determine the number 
of items to be included in calculating each average. 

Unfortunately, wave-lengths appearing in time series are 
commonly somewhat irregular, if not rather indeterminate. 
This constitutes one of the serious objections to the use of the 
moving average in the measurement of trends. The difficulty 
is so fundamental and the process of obtaining the moving average 
so laborious, that in the majority of instances it is preferable to 
measure trend-through the mathematical derivation of a line of 
best fit rather than by means of a moving average. 

The mathematical determination of a line of trend involves two 
fundamental problems: (1) the choice of a definite form of line — 
such as the straight line, the compound interest curve, the para- 
bola, hyperbola, or some more complicated form; (2) the selec- 
tion of the period upon which this particular form of curve is to 
be fitted. The choice of the form of line will be considered first. 
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The subject as a whole is one which cannot well be dealt with 
comprehensively in a text on elementary statistical analysis. 
-Some indication may be given, however, of the types of curve 
encountered in statistical data. 


Cuart 68. AVERAGE Datty RATE oF PropucTION oF Pic IRON 
For ALL FURNACES IN THE UNITED STATES, 1898-1914 


Thousand Tons 


(0) 
1898 1900 1902 1904 1906 1908 1910 1912 1914 
Year 


Trend: Straight Line 


Equation: y = 56,768 + 2769 x Origin = July 1, 1906 Unit of time = 1 year. 


oe O. W. Blackett, Statistical Notes on the Iron and Steel Industry (unpublished 
thesis). 


With a great many variables encountered in economic and 
business science, the assumption of a straight-line trend is ade- 
quate for the purposes of analysis, and is as exact an assumption 
as the crudities of the data will justify. Straight-line trends of 
growth imply expansion witha steady increment —in other words, 
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at a steadily declining rate; or contraction with a constant decre- 
ment, that is, at a steadily increasing rate. Such conditions are 
common among economic and business phenomena. The case 
of expansion shown in Chart 68 serves as an illustration. 
Compound-interest curves are not so frequently encountered, 
but may be exhibited in the expansion of an economic variable 


CHART 69. PRODUCTION OF CIGARETTES IN THE UNITED STATES, I90I-I915 
Billion 


20 
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12 
Y, 
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Trend: Compound-Interest Curve 


y = (5.636)(1.158)” Origin = 1907 1913 = 13.61 


Source: Edmund E. Day, “An Index of the Physical Volume of Production,” reprint 
from the Review of Economic Statistics. 


over a considerable period. The production of cigarettes in the 
United States, shown in Chart 69, conforms very closely to the 
compound-interest curve for the period rgor to 1915. 

Still another curve which is of particular interest in the study 
of economic factors is the so-called Gompertz curve. The 
general character of this curve is shown in Chart 70. The 
curve has the excellent quality of exhibiting an increasing 
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rate of growth for a time, with a declining rate of expansion in a 
later phase. This corresponds, of course, to the development 
which is disclosed in the growth of a large number of economic 
factors. 

The choice of the form of line to be used in determining the 
trend of a time series calls for exercise of the utmost care and 
good judgment. Inspection of a plot of the series on natural 


Cuart 70. Forp SALtes By MONTHS, 1912-1920 


Number of Cars 
(in thousands) 


1912 1913 1914 1915 1916 1917 1918 1919 1920 
Year 


Trend: Gompertz Curve 
y = 3.10059 — 1.84842(.7584)* 
Source: Data supplied by R. B. Prescott. 


scale will ordinarily suggest fairly definitely some particular form 
of curve.’ The implications of this curve should be carefully 
checked, however, by supplementary studies of the variable in 
question. Confirmatory data should be sought from every pos- 
sible source, and by one means and another the reasonableness 
of the assumed form should be thoroughly verified before it is 
adopted as a true representation of the trend of the variable. 


1 Tf the plot to natural scale appears to develop a compound-interest curve, it may pay to 


replot the data on logarithmic scale, to see if on this scale a straight line appears to represent 
the trend accurately. 
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This checking of the assumed trend is especially important if 
the line of trend is extended beyond the point of actual record 
and employed as a forecast of the probable future of the variable. 
A line which has served very well as a fit of the data already on 
record may not serve at all to represent the long-time movement 
of the same variable in the future. Some new factor may enter 
at any time to turn the trend of the series. When, therefore, a 
trend is extrapolated, the implications of the line beyond the 
period of record should be carefully checked against every avail- 
able line of evidence bearing upon the future of the factor in 
question. It is altogether possible when such a study is made 
that a line of future trend may be selected quite different from 
the extension of the line which fits satisfactorily the data of the 
record of the past. 

In selecting the period to which the line of trend shall be 
mathematically fitted, two considerations are to be held in mind. 
In the first place, the conditions prevailing at the two extremities 
of the period of fit should be as nearly alike as possible. If at 
one end the variable is relatively high, it should be relatively 
high at the other end. If the series is strongly cyclical in char- 
acter, the same phase of the cycle should be represented at both 
ends. If this condition is not provided by the period of fit, the 
line of trend will be disturbed somewhat through the dissimilar 
conditions represented at the extremities. 

In the second place, care should be taken to make the period of 
fit relatively homogeneous throughout; that is, there should not 
have occurred in the period any disturbances so fundamental as to 
have involved any considerable change in the trend of the vari- 
able. Periods of abrupt changes are to be omitted. Whether 
or not the period is homogeneous must be determined from the 
study of related information. Such information may not be 
altogether quantitative in character. In so far as it bears upon 
the underlying tendencies of the factors under analysis, it will 
serve to indicate whether the period of fit tentatively chosen is 


11t is to be clearly understood that the technique of forecasting is quite distinct from 
mere curve fitting, and in general is much more difficult. 
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sufficiently homogeneous to warrant the development of a single 
line of trend. 

Subject to these considerations, the period of fit should be as 
long as possible. Five to ten years is ordinarily necessary to 
disclose a trend clearly. Lack of adequate data is commonly the 
limiting factor. Data may be either entirely lacking, or incom- 
parable over longer periods. When satisfactory data are not 
available for more than a brief period, no determination of the 
trend can safely be made to rest directly on the data. 

For longer periods economic variables not infrequently display 
trends which cannot be represented at all by a straight line or 
rudimentary curve. Some line of more complex character is 
indicated. Charts 69-71 afford illustrations. In general there 
is no limit to the extent to which refinements of technique may 
be undertaken in the direction of fitting complex forms to em- 
pirical data. 

Selection of the form of line to be used in representing the 
trend is ordinarily followed by calculation of the equation of the 
line. In the case of the straight line, this is a relatively simple 
matter. The steps to be taken in this case may be briefly ex- 
plained. 

As previously indicated, the equation of a straight line may be 
given in the generalized form of y = mx + 6, where y is the 
dependent variable, x the independent variable, m the slope of 
the line, and 0} its intercept (the latter being the distance 
from the X-axis at which the line crosses the Y-axis). Any 
straight line is given, then, if the values of m and 6 are known. 
In obtaining a straight line of best fit, the problem is to obtain 
those values of m and 6 which determine a straight line most 
closely corresponding to the general trend of the series as ex- 
pressed in the data empirically given. 

The simplest method of obtaining the values of m and 0 is the 
so-called method of moments. In this method, two equations 
are solved simultaneously for m and b. These two equations 
are Dy = m=Zx + nb and Lxy = myx? + brYx. The solution of 
these equations may be simplified by assuming that + — the 
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CHART 71. SECULAR TRENDS IN THE INDEX NUMBERS OF THE 
YIELD AND PRICE OF CORN AND WHEAT, 1882-1918 


Index of the Yield of Corn 


Index of the Price 
of Corn 


Index of the Yield 
of Wheat 


Index of the Price 
‘of Wheat 


Equations of lines of trend: 

Com: Yield, y = 103.9 + .838 x — .0032 4* — .002233 x° 
origin at 1899 
Price, y = 102.4 + 1.690” + .2275 27 — .001288 x3 
origin at 1897 

Wheat: Yield, y = 102.1 + .952.% — .0004 «” — .oorror x 
origin at 1899 
Price, y = 99.2 + 1.057 4 + .1780 x2 — .005570 x* 
origin at 1897 


Source: H. L. Moore, Generating Economic Cycles, p. 20, 
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origin of time if the series be a time series — lies at the middle of 
the period of fit. In this case the sum of all the x’s, positive and 
negative — in other words, 2« — becomes o, and the two equa- 
tions above are resolved into two simpler equations as follows: 


Ly = nb _-and\ 2ay = mix 
It follows therefore that b, the intercept of the straight line of 
best fit, equals = or the arithmetic mean of all the y’s; and m, 


the slope of the line, equals Se where the y’s are values of the 
dependent variable, and the «’s are units of time measured from 
the middle of the period of fit. The determinations of 6 and m 
from these two equations are comparatively easy. 

The nature of the computation involved in the determination of 
the equation of the straight line of trend will be evident if a con- 
crete illustration is considered. Suppose a straight line of trend is 
to be fitted to the data of annual pig-iron production for the years 
1904 to 1914. The data are given in column (a) of Table 84. 

TABLE 84. CALCULATION OF SLOPE AND INTERCEPT OF STRAIGHT LINE OF 


Brest Fir py MretHop or Moments: Opp NUMBER 
or ITEMS IN PERIOD oF FIT 


(Annual Pig-iron Production, 1904-1914) 
(Unit: million long tons) 


OrrciwaL Item es in| DEvraTion Times ORIGINAL DEVIATION 
Ee (y) Units or True |. Ire SQUARED 
(x) (xy) (x?) 
(a) (b) (c) (@) 
1904 16.50 — 5 — 82.50 25 
1905 22.99 —4 — 91.96 16 
1906 AG ii — 3 — 75.03 9 
1907 | 25.78 —2 — 51.56 
1908 | 15.04 JE — 15.94 — 317.89 I 
1909 25.80 ° ° 
IQIO 27.39 +1 27.30 I 
IQII 23.65 + 2 47.30 4 
IQI2 20.73 ae ee 89.19 9 
1913 30.97 +4 123.88 16 
1914 23533 +5 116.65 404.32 _25 
267.30 = Zy 86.43 = Dry Tro) = Lx 
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The middle of the period of fit here lies at the year 1909. This 
year is consequently regarded as the origin of time; in other 
words, the point at which x equals o. Other values of x are given 
negatively for the earlier years and positively for the later years, 
as shown in column (0) of the table. The sum ofall the y’s 


(Zy) is 267.30. Therefore, the intercept 6, which equals zy 
n 


is equal to 23s = 24.30., The values of the several xy’s appear 
I 


in column (c). The algebraic sum of these is 86.43. The total 
of all the x’s squared is 110. Therefore, m, which equals <a : 
a 


ae = .7857. The equation of the straight line of 


trend for pig-iron production from 1904 to 1914 is, thus, y = 
-7857 x + 24.30, the origin of x being 1909. 

The meaning of this equation may be briefly indicated. The 
equation states in mathematical form that the long-time trend 
of pig-iron production during the period 1904 to 1914 called for 
a production of 24.30 million long tons in 1909, the year standing 
at the middle of the period, and an increase of .7857 million 
long tons during each year of the period. This being the meaning 
of the equation, the value of the line of trend at any particular 
point of time is readily ascertained. This value of the line of 
trend, commonly referred to as the ordinate of trend, is to be 
found by adding to, or subtracting from, the intercept (the value 
of y at the origin of time) the number of increments (an incre- 
ment being equal to the value of m) corresponding to the 
number of intervals of time (represented in the value of «) from 
the origin to the date in question. Thus, for the year 1904, 
x =— 5. The value of y—in other words, the ordinate of trend 
— for 1904 is therefore given by the expression: y = (— 5) .7857 
+ 24.30 = 20.37 million long tons. 

A slight complication is introduced in the computation of the 
straight line of fit by this method if the number of periods of time 
represented in the period of fit is even. In this case, there is no 


is equal to 
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single item standing at the middle of the series. If the origin of 
time is to be taken at the middle of the period of fit, it must be 
assumed to lie halfway between the two middle-most items. 
If these are separated by a year, the origin may therefore be 
assumed to be six months removed from each. The computation 
may then be put through if the x’s of the items on either side of 
the origin are assumed to be + 4.and — 4, + 14 and — 13, + 23 
and — 24, etc. It is simpler, however, to assume in this case 
that the interval of time is six months instead of one year. The 
values of x then become + 1 and — 1 (one period of six months), 
+ 3 and — 3 (three periods of six months), + 5 and — 5 (five 
periods of six months), etc., as one moves away (forward and 
backward) from the origin of time. 

~The computation of the line under the conditions just assumed 
is illustrated in the data of Table 85. Clearly, the computation 

TABLE 85. CALCULATION OF SLOPE AND INTERCEPT OF STRAIGHT LINE OF 


Brest Fit py MretHop or Moments: EvEN NUMBER 
oF IrEMS IN PERIOD OF Fit 


(Annual Averages of Bradstreet’s Monthly Index of Commodity Prices, 1903-1914) 
(Unit: one dollar) 


Yean | Ostomat Tera | Gott or Tue |  Onicivan ITEM oe 

(x) (xy) (**) 
(a) (0) eG) (d) 
1903 7.94 — II — 87.34 121 
1904 7.92 — 9 — 71-25 81 
1905 8.10 = 9 — 56.70 49 
1906 8.42 5 — 42.10 25 
I9QO7 8.90 = = PAO7f0) 9 
1908 8.01 I = BOI > Dawe I 
1909 8.52 + 1 8.52 I 
1910 8.99 + 3 26.97 9 
IQ1I 8.71 Bet eS 43-55 25 
1gI2 9.19 Zen) 64.33 49 
IQI3 9.21 + 9 82.89 81 
IQI4 8.90 + II 97.90 324.16 121 
102.81 32.03 572 

eee 102.8% _ 8 32.03 

= —— = 8.57 m= >—~ = .056 
12 572 


Equation: y = .056 x + 8.57 
where « = 6 months and origin of time is January 1, 1909 
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here follows precisely the same form as in the case of an odd 
number of items in the period of fit. It must be carefully noted, 
however, that in the equation obtained from the computation, x, 
the annie aaa it of time is six months. and not one year. This fact must 
“be taken into account in computing the ordinate of trend for 
any particular point in time. 

Not infrequently it is desirable to translate an equation which 
has been computed on the basis of data representing one set of 
time-intervals into an equation representing the corresponding 
series with a different set of time-intervals. Thus, if the equation 
has been computed on the basis of annual pig-iron production, 
it may be desirable to have the corresponding equation for 
monthly pig-iron production. In this case it is not necessary to 
compute anew the line of fit for the monthly items. A simple 
translation of the equation effects the desired result. This 
follows from the fact that the constants of the two equations (m 
and b) bear definite relationships to each other. The intercept 
6 of the equation for monthly items will be just ~y of that for 
annual items. This follows from the fact that the monthly 
totals are on the average just 7 of the annual totals. If this is 
true of the intercept, it is likewise true of the slope of the equa- 
tion. In a sense, the unit of BiseaU gent for both series and 
equation has been reduced to 7 of its pe value. In addi- 
tion, the increment is reduced Beit to z'z because it applies in the 
monthly equation to a period only 75 as rede as the period of the 
increment in the annual equation. To reduce an equation com- 
puted for yearly aggregates at annual intervals to one for monthly 
aggregates at monthly intervals, the intercept of the first equa- 
tion consequently should be divided by twelve and the slope 
by twelve times twelve, or 144.1 Thus, the equation for annual 
pig-iron production may be converted to the corresponding 
equation for monthly pig-iron production as follows: 


1 The relationships will be different, of course, if the intervals of time are not in the ratio 
of 12:1, or if the magnitude of the annual and monthly items are not in this ratio, as they 
are not, for example, in a series of price indices by years in one case and by months in another. 
In translating equations from one base to another, the student must bear carefully in mind 
the precise relationship between the magnitudes of the items for the different time-intervals 
and the relative length of the time-intervals themselves. 
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Annual data y = .7857 x + 24.30 (x = 1 year) 

Monthly data N= {0055 & + 2025 (x = 1 month) 

Similarly, the equation for annual averages of Bradstreet’s 
price index may be converted to an equation for the monthly 
index as follows : 


Annual data y = 056% + 8.57 (x = 6 months) 
Monthly data y = .0093 x + 8.57 (x = 1 month) 


Here the yearly items, as averages of the monthly, are on the 
whole neither larger nor smaller than the monthly items. The 
only change that need be made in the equation is a division of 
the slope by 6 to adjust for the shortening of the interval of time. 

The determination of the equation of a compound-interest curve 
is most easily accomplished by fitting a straight line to the 
logarithms of the original data. The same pair of equations is 
used as in the illustrations given above. The nature of the 
computation is shown in Table 86. 

Methods of calculating the equations of curves of more complex 
character do not lie within the scope of the present study.! 
Not infrequently, advanced mathematical technique is required. 
Moreover, the logic underlying the fitting of a curve to empirical 
data is the same whether the formula be simple or complex. 
The major problems in the analysis of evolutionary movements 
in time series relate not so much to the precise form of curve to 
be employed as to the significance of the line of trend and the 
uses to which it may be reasonably put. 

It may be desirable to eliminate the trend from the record of a 
variable. The trend having been accurately determined and 
mathematically formulated in a definite equation, elimination is 
effected by relatively simple means. The procedure involves, 
first, the calculation of the ordinate of trend corresponding to 
each interval of time for which items are given in the series from 
which the trend is to be eliminated. For straight-line trends, 


1 For the fitting of some of these more complex forms a convenient reference is Mills, 
Statistical Methods, chap. VII. R. B. Prescott, in an article on ““The Law of Growth in 
Forecasting Demand” (Journal of the American Statistical Association, Dec., 1922, pp. 
471-479), explains the application of the Gompertz curve to empirical data. 
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these figures are readily obtained through the use of an adding 
machine: if the ordinate of trend is obtained for the first date in 
the series, the ordinate for each successive date is secured by 
adding the increment (m) corresponding to the interval of time 
expressed in the series. The results thus secured may be en- 
tered in a column paralleling the original items from which the 
trend has been derived. 


TABLE 86. CALCULATION OF CONSTANTS OF COMPOUND-INTEREST CURVE OF 
Brest Fir By METHOD oF MoMENTS APPLIED TO LOGARITHMS 


(Yearly Production of Cigarettes, 1901-1913. Unit: billion cigarettes) 


DEVIATION 


LoGARITHM 
16) D T D 
YEAR ORGOUL Os aoe aw Users oF “Locssrrnt SQUARED 
(y) (x) > ( 
(a) (0) (c) (d) (e) 
IQOI 2.72 4346 —6 — 2.6076 36 
1902 2.96 4054 —5 92532170 25 
1903 3.30 5263 —4 — 2.1052 16 
1904 3-43 5353 5) ae 5/9) 9 
1905 3.67 -5047 =2 sae E204 4 
1906 4.50 .6532 —I — 0532 10.3053 I 
1907 5.26 7210 fo) fo) 
1908 5.74 7589 ae 7589 I 
1909 6.82 8338 +2 1.6676 4 
IgIo 8.64 9365 +3 2.8095 9 
IQII 10.47 TIOIQ9 +4 4.0796 16 
IQI2 TE ee7, 1.1196 +5 5.5980 25 
1913 15.56 I.1920 + 6 7.1520 + 22.0656 36 
9.7612 + 11.6673 182 
aes 9.7612 __ Eee ove 11.6673 _ 0641 
182 


In terms of logarithms y = .0641 x + .75090 
.. In natural numbers y = (5.636) (1.158)” 


The second step in the elimination of trend involves the divi- 
sion of each original item by the corresponding ordinate of 
trend.1 By multiplying the percentage ratios thus derived by 


1 This method assumes that deviations from the trend are to be regarded relatively, not 
absolutely. This is, presumably, the attitude ordinarily to be taken. If it appears, 
however, in a particular case that deviations from the trend are to be dealt with abso- 
lutely, the ordinate of trend should be subtracted from the original item, not divided into it. 
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100, a series is obtained which is expressed in convenient whole 
numbers and is free of the trend of the originalitems. If desired, 
the result may be carried one step further by stating the items in 
amounts plus and minus above and below 100. ‘The results thus 
obtained for the case of monthly pig-iron production from 1906 
to 1909 are given in Table 87. Similar figures for the period 


CHART 72. Fiuctruations or Pic-IRon PRODUCTION ABOVE AND 
HE LINE OF END, I9IQ-192 
Per Cent be yeas BeOS Bide ok Per Cent 
140 


JFMAMJJASOND |JFMAMJJASOND|JFMAMJJASOND JFMAMJJASOND JFMAMJJASOND 


1919 1920 1921 1922 1923 
Year 


Source: Review of Economic Statistics, prel. vol. V, no. 1, p. 34; VI, Supp. 1, p. 128. 


1919-1923 are Shown in Chart 72. In the curve here shown, the 
influence of the trend of the variable does not appear at all. 

The satisfactory determination of lines of trend is almost 
always a difficult phase of the statistical analysis of time series ; 
at times it is quite impossible. The series may be so badly broken 
by some cataclysmic force as to eliminate all traces of trend for 
a while; or for considerable periods in which underlying tenden- 
cies are undergoing alterations, the trend may be indeterminate. 
Under these conditions, it may still be possible to eliminate the 
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TABLE 87. ELIMINATION OF TREND THROUGH EXPRESSION OF ORIGINAL ITEMS AS PER- 
CENTAGES OF ORDINATE OF TREND 
(Monthly Pig-Iron Production, 1906—1909*) 
(Unit: thousand long tons) 
(Trend based on period 1904-1914) 
Rela OF 
OricrnaL | ORDINATE oF ONAL 
Monta Item TO Or- 
Ire DINATE OF 
TREND 
(a) (6) (c) 

1906 January 2068 1799 IIs 
February 1904 1804 IO5. 
March 2155 1810 I19 
April 2073 1815 114 
May. 2098 1821 ns) 
June . 1976 1826 108 
July 2013 1832 IIo 
August ‘ 1926 1837 105 
September . 1960 1842 106 
October 2106 1848 119 
November . 2187 1853 118 
December . 2235 1859 120 
1907. January 2205 1864 118 
February 2045 1870 109 
March 2220 1875 I1Q 
April 22106 1881 118 
May . 2205 1886 122 
June . 2234 1892 118 
july - 2255 1807 IIQ 
August A 2250 1902 118 
September . 2183 1908 115 
October 2330 1913 122 
November . 1828 1919 95 
December . ; 1234 1924 64 
1908 January ’ 1045 1930 54 
February 1077 19035 56 
March 1228 IQ41 63 
April . I149 1946 59 
May . 1165 IQ5I 60 
June . 1092 " 1057 56 
July . 1218 1962 62 
August . . 1348 1968 690 
September . 1418 1973 72 
October . 1563 1970 70 
November . 1577 1984 70 
December . 1740 1990 87 
I909 January 1801 1905 90 
. February 1703 2001 85 
March 1832 2006 QL 
April 1738 2011 86 
May . 1880 2017 03 
June . 1929 2023 05 
July . 2101 2028 104 
August : 2240 2033 TEL 
September . 2385 2030 117 
October . 2600 2044. 127 
November . 2047 2050 124 
December . 2635 2055 128 


* Monthly figures comprise about 85% of the yearly totals. 
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influences of trend so far as such influences are operative. The 
result is accomplished through the use of a second series which 
may be reasonably assumed to have been subject to the same long- 
time influences as the series from which the trend element needs 
to be taken. By using this second series as a reference line, the 
trend may be taken out without being measured. A concrete 
illustration will serve to make the procedure clear. 


CHART 73. CYCLICAL FLUCTUATIONS OF COMMODITY PRICES 


(Harvard Ten-Commodity Index Minus 5.38 Times Bradstreet’s Index, 
Per Cent zong= 1022) 


1919 1920 1921 1922 


Source: For data see “‘ Statistical Record,” Review of Economic Statistics, prel. vols. 


III, IV, and V. 


Suppose the problem is to eliminate the influence of trend in 
two series of price indices. Owing to widespread monetary 
inflation and deflation, the trends of these price indices have 
been practically indeterminate since 1914. It may reasonably 
be assumed, however, that the two indices tend to exhibit the 
same trend, and are being influenced in approximately the same 
degree by such underlying tendencies as are currently operative. 
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TABLE 88. ELIMINATION OF TREND THROUGH COMPARISON OF 
SERIES OF THE SAME SPECIES 


(Bradstreet’s and Harvard Committee Indices, 1919-1922) 


HARVARD Com- 

BRADSTREET’S ppvane Cow BRADSTREET’S | MITTEE INDEX 

Monts InDEx oF Com- Taran eres Inpex Mutti-| Mrnus Brap- 

MODITY PRICES ie PLIED BY 5.38 STREET’S INn- 

DEX ADJUSTED * 

(a) (0) (c) (d) 

TOLO lanvarys ye se te 18.53 98.9 99.7 — 8 
Bebruarye Qoaes ey a: 17.63 ‘ 03-1 94.8 — 17 
Mar chyaae 3 cee re 17.22 87.9 92.6 Any 
(Aprile 5 to athe ka Oe 17.28 85.2 93-0 — 7.8 
Mayo te Bee eee 17.24 87.0 92.8 — 5.8 
WME a ee ee 18.009 06.9 07-3 — 4 
H WIN ey eg ee 18.90 101.6 IOI.7 — I 
JNUTGo ist A Bs 20.00 108.2 107.6 + .6 
September . - = . 19.47 105.9 104.7 + 1.2 
OctobeS-aaes 4 = 19.52 102.1 105.0 — 2.9 
November... .; 19.90 110.0 107.1 + 2.9 
December” = 2 = 20.18 114.9 108.6 + 6.3 
HOQgO) January. 2 2 - 20.37 118.2 109.6 + 8.6 
Hebruany, anes mee 20.87 120.4 102.3 + 8.1 
Miarchiet det < % 20.80 121.6 iam tte(a) + 9.7 
ATLL atc? ae Sree 20.71 121.0 II1.4 + 9.6 
May eye ck ae iG 04 20.73 128.5 Townes 07.0) 
lines ones. 19.88 128.6 107.0 + 21.6 
ANU Ve= senate tae cel Poe 19.35 124.1 104. + 20.0 
UNMIS 5 6 6 18.83 115.7, 101.3 + 14.4 
September... . 17.97 108.7 06.7 + 12.0 
October seme tale ae 16.91 102.5 QI. + 11.5 
November . .-. . 15.68 87.2 84.4 + 2.8 
Decemberes— 2. - 13.63 74.0 73.3 + .7 
“ovine TER eian, 4 oo 8 12.66 65.8 68. — 2.3 
Insoenny 7 5 5 6 12.37 62.9 66.6 — 3.7 
Vianch) see ieens 11.86 57:9 63.8 — 5.9 
April saecuets At hs 11.37 51.9 61.2 — 9.3 
May eck ye ek bee os 4's 10.82 Be 58.2 — 6.5 
junes ge ea) Ss ce 10.62 52.3 ype — 4.8 
IU ey See ee 10.73 Epis Sich) — 6.6 
INUSUISt ee cele ee 11.06 49.8 50.5 — 0.7 
September... . 11.09 pe 59-7 — 7.6 
Octoben =. ces: I1.1Q 55-0 60.2 — 4.3 
November . . ... BES 5 Sey 61.1 — 5.4 
December . . 4. . Dies Soul 60.8 — 7.7 
1@O2 Jenwbehan 5 4 9 a o TL-37, 53-6 Oise — 7.6 
Kebruarye 22s < II.42 53-2 61.4 — 8.2 
Marchi es -a6 a 11.60 55-5 62.4 — 6.9 
ATI Mime nes ae es 11.53 54.3 62.0 — 7.7 
Mayer sy ce ens II.70 50.6 63.0 — 0.4 
UNG ee ee es II.90 60.1 64.0 — 3.9 
fil Voieear ee ss ae 12.05 63.4 65.2 — 1.8 
AUIPUSt aera eeat char Uae 12.07 63-4 64.9 — 1.5 
September... . 12.08 64.6 65.0 — 4 
Oar o 5 & a 6 12.50 68.2 67.3 + .9 
November... . Tens WT 3 72.9 — 1.6 
December ... . 13.78 756 74.1 + 1.5 


* Note: These figures clearly contain no trend element. 
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If one of the two series is of such a nature as to fluctuate more 
widely than does the other, a series from which the trend has 
been eliminated may be secured by taking the deviations of the 
more widely variable series from the corresponding items of the 
less widely variable series. ‘This derived series of items is free 
from the element of trend since the two original series are equally 
influenced by trend, and the differences between the two con- 
sequently can show no trend influence. The method of elimi- 
nating the influence of trend in this fashion is illustrated in 
Table 88 and Chart 73.! 

The trend is only one of the several types of movement dis- 
cernible in time series. Seasonal variation, cyclical fluctuations, 
episodic and accidental movements may all combine to influence 
the items. Having considered the measurement and elimination 
of trend, it is desirable to pursue the same line of analysis for the 
other important typical elements. 


1See W. M. Persons, ‘The Revised Index of General Business Conditions,” Review of 
Economic Statistics, prel. vol. V, p. 192. 


CHAPTER XVIII 


PERIODIC MOVEMENTS: SEASONAL VARIATION 


Amonc the different varieties of movement exhibited by time 
series, one of the most important is the recurrent movement 
of regular and definite period. Bowley describes the periodic 
series as “consisting of figures which within a given period . . . 
reach maxima and minima at assigned times, and show the 
fluctuations recurring with regularity in successive periods.”’} 
Movements of this general type occur frequently. In some lines 
of analysis, periodic movements are of primary importance.’ 

A number of forms of periodic movement may be recognized. 
Perhaps the form of most general significance is seasonal varia- 
tion. Seasonal variation is a repetitive movement having its 
origin in the round of the seasons. The general character of 
this type of variation will be obvious from a few concrete 
illustrations. 

Some of the simplest illustrations of seasonal variation are to be 
found in the field of climatology. Take, for instance, the records 
of the flows of rivers. In Table 89 the mean monthly flows of 
the Colorado, Niagara, and Tennessee rivers are given. The 
periods of highest and lowest water differ among these three 
rivers; the three are, of course, very differently located and 
subject to very different conditions. Nevertheless, the marked 
seasonality of the flow of all three rivers is unmistakable. 

1 Elements of Statistics, p. 178. 

2 Movements of this sort are not to be confused with cyclical fluctuations. The latter, 
though undulatory, are not periodic; the wave-like movements are of varying length, 
whereas in periodic movements the wave-lengths are uniform. The distinction is important, 


for only when there is clear periodicity may certain lines of analysis be followed. The 
present chapter deals only with those movements which exhibit a regular and definite period. 
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Illustrations of seasonal variation may be found in every field 
of statistical inquiry. One of the most striking examples is in 
the incidence of different diseases. In Table 90 death rates for 
the registration area of the United States are given by months 
for general mortality and two specific diseases. The seasonal 
movements of these rates are widely divergent. But each ex- 
hibits a decided seasonal variation. 


TABLE 89. MEAN Montuty FLows oF CoLorapo, NIAGARA, 
AND TENNESSEE RIVERS* 


(Unit: cubic feet per second) 


Mean Monturiy FLow PERCENTAGE OF MONTHLY AVERAGE 
Monti 
COLORADO NIAGARA TENNESSEE | COLORADO NIAGARA TENNESSEE 
RIVER RIVER RIVER RIVER RIVER RIVER 
January. . . .| 10,000 | 177,000 | 85,000 42 93 157 
February. . . .| 16,000 | 174,500 | 100,000 67 gl 184 
March . . . .{| 24,000 | 181,500 | 110,000 IOI 95 201 
Aprile ss | 25,000 +! Tog,500) || 925000 105 IOI 70 
Mayen 2 = | 345000) || 200,500 50,000 143 105 g2 
une 4 = «|| 04,000! | (207,000 37,000 269 108 68 
uly aeew neice || 455000 e205, 500 lms'5,CCO 202 108 64 
FAUPTISL a = = ||) 23500011111 200;000 27,000 07 105 50 
September. . . || 13,000 || 195,500 22,000 55 102 40 
October. . . .| 11,000 | 190,000 20,000 46 99 37 
November. . . g,000 | 185,000 27,000" 38 07 50 
December . . . 8,000 | 179,500 | 47,000 34 04 87 
Monthly Average | 23,750 | 191,000 54,000 100 100 100 


* Colorado River figured for years 1901, 1905, 1907, 1908; 
Niagara River figured for 1892-1902 and 1907-1916; 
Tennessee River figured for forty-eight years. 


Source: Boston Sunday Herald, August 26, 1923, sec. C, p. 8. 


In economic and business analysis seasonal fluctuations are 
encountered on every hand. Concrete illustrations are easily 
cited. In Table g1 characteristic seasonal movements are shown 
for five different variables. 

The series here are drawn from widely separated fields, and 
evidence widely different responses to seasonal influences. They 
all serve, however, to indicate how clearly seasonal variations 
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TABLE 90. Ratio or MONTHLY DEATHS TO AVERAGE NUMBER FOR EAcH MontH 


MontH oe a PNEUMONIA TUBERCULOSIS 
Jenne 5 ee RS ee es 106 I40 95 
PODRUBIBY 6 6G 6 ee A a 106 184 IoI 
MCarchigay: ey, ioe eee IIo 219 99 
IAD ILL mn Oley ct an eee 107 134 105 
Va Vane Dh ees coy, ss ee oe 94 95 106 
UUNCey eevee cake me 89 AR 139 
Tully egeed oe taal WL ee a 107 38 97 
FMUGUSUM MECC. fouahe sccm eee 103 36 gl 
Sepeemiberan . ie eee 98 44 92 
@ctober ces Sie as QI 53 Q2 
Wowemlesrs “4 6 foo a & 92 82 04 
ecembert 1S) ge: <e i (eee ee One! 102 95 


may be exhibited by time series in the fields of economics and 
business. 

As already indicated, seasonal variation is only one form of a 
very general variety of movement, namely the recurrent move- 


TABLE 91. SELECTED MontTHLY INDICES OF SEASONAL VARIATION 


(Average month = 100) 


MonrTHLY ) MontHuiy RATE RETAIL 
VALUES OF BUILD- MontTHLy ry BE eENE S or INTEREST SALES OF 
inG Permits Is- |BANK Crrarines | ~~ONT Bustwrss | ON Catt Loans | Deparr- 
MonTH |gurp vor Twenty| New York Crry on A on New York MENT 
LEADING CITIES 1903-1917 Peachy Stock EXCHANGE] STORES 
1903-1917 - 1890-1917 1919-1921 
(a) (0) (c) (d) (e) 
January 70 II5 142 IIo 89 
February 73 go 99 83 73 
March 122 IOI 99 89 07 
April 135 IOI 05 gI IOI 
May 123 kere) 92 87 102 
June IQ 96 89 80 100 
July 106 97 04 81 72 
August 98 QI 88 80 67 
September 95 90 86 104 88 
October 95 109 08 119 122 
November 86 103 99 124 121 
December 78 108 118 150 168 


Sources: (a)—(d) Review of Economic Statistics, prel. vol. I, pp. 50, 55, 50, 63. 
(e) Monthly Review of Credit and Business Conditions, Feb. 1, 1922, p. 10. 
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ment of definite period. Movements of this general type differ 
in the lengths of their periods. Seasonal variation is marked by 
a period of twelve months, or a year. Other recurrent move- 
ments exhibit periods of much shorter duration, such as seven 
days, or twenty-four hours.!. These movements of shorter period 
may easily be illustrated. 

An illustration of the seven-day, or weekly, period may be 
found in the variation of the number of toll calls handled on a 
central telephone exchange. In Table g2 relatives for this vari- 
able have been taken from the records of toll calls handled by an 
important telephone company during the month of June, 1923. 


TABLE 92. TypicaL DatLty FLUCTUATIONS IN NUMBER OF 
TELEPHONE TOLL CALLS DURING JUNE, 1923 


(Average day = 100) 


Day oF WEEK Heitor 3 
Sundayie—c,o 4 pe law a eee fe 52 
Monday) <2 se Ass. es ed ee IIs 
ainescaiyn, 30 Sieur ten Nem mnunne dont: IIO 
Wiednesdavirar taste omer a nie 107 
‘Phursday.cice eAlmepees ae) een een 109 
Eriday po wiak eae enn os one IIo 
Saturday seas. sees weeks 97 


The figures show differences which might naturally be expected. 
The lack of business calls on Sunday and on the afternoon of 
Saturday reduces the total volume of business on these days. 
Monday rises to a maximum because of the accumulation of 
business over the week-end. The movement is obviously re- 
current since the same conditions persist week after week. The 
period is definitely fixed as of seven days since like conditions 
recur on a given day each week. 

An illustration of recurrent movements with an interval of 
twenty-four hours is given in Table 93, showing changes in the 
automobile traffic on Fifth Avenue, New York City, during 


1 These may be referred to as daily and hourly variations, respectively. 
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successive hours of the day. The two columns of figures — one 
for north-bound, the other for south-bound vehicles — show 
marked differences in the hourly movement of the two flows of 
traffic. Movements of this sort are exhibited by a large number 
of business factors. 


TABLE 93. TRAFFIC CHANGES ON Firta AVENUE AT Forty-SECOND STREET, 
New York City, BETWEEN THE Hours oF SEVEN A.M. AND SEVEN P.M. 


NUMBER OF VEHICLES 
Time or Day 


North bound South bound 

AM. 7-8 140 345 
8-9 425 I1I05 

g-10 785 I415 

Io-II 860 I120 

ib 995 1025 

vai aneieatt 960 I51I0 
2 910 960 

za 815 I055 

3-4 710 950° 

ASS 840 835 

5-6 1175 740 

6-7 I305 710 


An interesting combination of two recurrent movements of 
different periods is shown in Table 94, giving the average wind 
velocity at San Francisco by hours of the day and months of the 
year. The row at the top of the table shows for the period 1891— 
rgto the variation of wind velocity by months of the year (in other 
words, the seasonal variation). The column at the left of the 
table shows, similarly, the variation in wind velocity at hours of 
the day (in other words, the hourly, or diurnal, variation). The 
two movements are of the same fundamental type. Both show a 
definite and regular periodicity. 

Though recurrent movements for periods other than the year 
are important, and call for careful analysis in many cases, seasonal 
variation is to be regarded as the most significant of the recurrent 
movements of regular and known period. The methods of 
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analysis to be employed in measuring periodic movements may 
be best developed, therefore, for the case of seasonal variation.! 

The measurement of seasonal variation has to deal with two 
essentially different conditions: (1) the condition in which 
seasonal movement stands by itself except as disturbed by minor 
accidental fluctuations; (2) the condition in which seasonal 
variation is merged with other varieties of movement, say, with a 
pronounced trend, or with marked cyclical or irregular fluctua- 
tions. The problems involved in measuring the seasonal in 
these two cases are quite different. The measurement of an 
isolated seasonal is the simpler of the two and will be dealt with 
first. 

Measurement of seasonal variation under the simple conditions 
in which it occurs undisturbed by other influences, save those which 
are accidental, may be based upon the aggregate of each month for 
the entire period and the corresponding average monthly aggre- 
gate for the same period. Suppose, for example, that we have to 
measure the seasonal variation shown in the data of Table 95. 
Here the rainfall in inches on the Wachusett Watershed, the 
source of the Greater Boston water supply, is given by months 
from January, 1897, to December, 1920. The rainfall for all the 
January’s is readily obtained in the form of an aggregate for the 
twenty-four years. The same is true for the month of February, 
and for each of the remaining months of the year. A corre- 
sponding average monthly total may be had for the same period of 


1 Fora discussion of methods of measuring seasonal variation, with interesting illustra- 
tions, see A. L. Bowley and K. C. Smith, ‘‘Seasonal Variations in Finance, Prices, and 
Industry,” Special Memo. No. 7 (July, 1924), London & Cambridge Economic Service. 

Tf the primary object of analysis is to measure cyclical fluctuations, an attempt may 
be made to eliminate the influence of seasonal variation without determining the character- 
istics of the seasonal movement. Elimination without measurement may be effected by 
referring the original items to a series of items which, presumably, have the same seasonal 
variation as the series being analyzed. Thus, if each monthly item in a time series is 
reduced to a relative of the average item of the same month for the preceding ten years, a 
“correction”? has been made for seasonal variation. (See M. T. Copeland, “Statistical 
Indices of Business Conditions,” Quarterly Journal of Economics, vol. XXIX, pp. 522-562.) 
This procedure is not altogether satisfactory for two reasons: in the first place, it implies 
that the seasonal variation is accurately measured by differences in the average base items 
for the several months, an implication which is not altogether warranted as we shall see; 
and in the second place, it does not afford any characterization of one of the most impor- 
tant features of the original series. 
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twenty-four years. The seasonal variation is then registered in 
simple fashion in a series of relatives, above and below 100, 
expressing the relationship of the aggregate of each month to the 
average monthly aggregate for the years upon which the 
measurement of the seasonal variation is based.! The results 
thus obtained are shown in the bottom row of the table.” 

The method here is simple and commonly satisfactory, but is 
sometimes open to objection. It is to be noted that seasonal 
relatives obtained by this summation method are decidedly in- 
fluenced by extreme items. Where such extreme items appear 
to upset the relatives substantially, it is necessary to adopt a 
somewhat different method. A simple procedure is to calculate 
the relative of the item of each month to the average of the 
monthly items of the year in which the month lies and then to 
take the median of the relatives of each month —z.e. all the 
January’s, all the February’s, and soon. The seasonal relatives 
may then be found by expressing these median relatives as 
percentages of their own average. This method clearly avoids 
the objection that the relatives are seriously affected by extreme 
items. 

Measurement of seasonal variation in many time series does 
not, however, present the simple conditions prevailing in the 
example just cited. Economic and business time series, for 
instance, commonly show seasonal variation merged with secular 
trends and cyclical fluctuations as well as accidental and episodic 
movements. When this is the case, it is not safe to obtain sea- 
sonal relatives from any mere summing or averaging of the items 
of the successive months of the year. More refined analysis is 
desirable. 

One of the most satisfactory methods thus far devised for 
measuring seasonal variation is that known as the link relative 
method. This method entails the following steps: 


1 Any set of seasonal relatives (index of seasonal variation) should average roo per cent 
since it is of the nature of seasonal variation to involve no net change for the year as a whole. 
2 The method used in this example may be referred to as the simple summation method. 
Identical results would be obtained by working with monthly (arithmetic) averages instead 
of monthly aggregates, but the aggregates have the advantage of being more easily obtained. 
3 Devised by Warren M. Persons of the Harvard Committee on Economic Research. 
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\ 1. Calculation of the percentage of each monthly item to the 

' item of the preceding month. These percentage figures are 

S vknown as the “link relatives.” 

(4 Fei formation of a multiple frequency table of these link 

Z. Néla ives — a separate frequency distribution being set up 

“ye ach month ; that is, for all the January link relatives, all 
t 


ebruary link relatives, and so on. 


e determination, from the multiple frequency table, of 
the median link relative for each month. 


4. The chaining together of these median link relatives by 
taking January equal to 100 (or to any other convenient 
figure), then applying the February median link to this 
January item, then applying the March median link to this 
result, and so on until the chain has been carried around 
back to January. 

5. The elimination of any discrepancy disclosed by the com- 
pletion of the chain of the median link relatives. 


3. 


—- AGS The reduction of the corrected chain relatives to an aver- 


if ge\of 100, 


“These steps call for brief explanation. The method can be best 
followed by taking a concrete illustration.’ 


The first step of the method consists, as indicated, in sige 


. each monthly item in the series to a percentage of the item of the 
* preceding month. The results of this reduction in the illustrative 


case are shown in Table 96. By referring each item to the pre- 
ceding item as a base, little opportunity is given for the play of 
secular or cyclical influences. The link relatives thus afford a 
satisfactory foundation for the measurement of seasonal variation. 

The second step consists in obtaining the median link relative. 
This entails the formation of frequency distributions for the 
link relatives of each month. These frequency distributions for 
the illustrative data are given in Table 97. 

The determination of the median links from these frequency 


» distributions is a relatively simple matter, the steps to be 
‘taken having already been considered (see Chapter X). The 


x thedian link is supposed to represent the typical relationship 


1 Data from Table 79, page 243, will be employed. Items covering a period of at least 
several years are necessary if seasonal variation is to be reliably determined. 
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between the month and the preceding month. The median 
link relatives of pig-iron production are given at the bottom of 
Table 97. 


TaBLeE 96. Link RELAtives oF Monrtuty Pic-IrRon PRODUCTION, 1904-1914 


(Percentage of each item to item for preceding month) 


1904 | 1905 | 1906 | 1907 | 1908 | 1909 | 1910 | 1911 | 1912 | 1913 | 1914 
Jignaubenay, 5 j| 7 4 | tiwey | atom |} cofey IG | awh |) ey || fon |} ames |] eee) || oy 
February .| 131 90 | 92 92 | 103 05 C2 |) we | wey |) 6h |) 1686 
Miarchi sees |) 620m) ton errom| TOON | EEVAs|=ro7—|'T0o) | 122) TTA Tova ered 
API. =. =| 107 |= 99)|) 90 | Too} 94°] 95 | -95 | 94 | 90.1 100 | 07 
May. = =| 00) | T02 | ror | 104 | ror | 108 96 92 | 106 | 103 92 
Waite, 5 5 || Sh © || Che Op) OA Mes I OG) Ol Gy | og |) oz 
lve 3 1) SO O7 (erO2Neror | 12) | ero9 95 | 100 99 97 | 102 
August . .| 106 | 106 QOm | LOOM | EL2) |) 107 98 | 107 | 104 | 100 | 102 
September .| 116 | 103 | 102 97 | 104 | 106 98 | 103 98 99 04 
October. .| 107 || 108 | 112 | 107 | 11I | rog | 102’ | 106 | to9 || 102 04 
November . | 102 98 | 100 78 | IOI 98 gI 95 98 88 85 
December .| 109 | 102 | 102 (G)y/ ||. aneifey | anoyes O23) 1025 r00 89 | 100 


Source: Review of Economic Statistics, prel. vol. I, p. 67. 


The next step consists in chaining together the successive 
median link relatives. For this purpose January may be assigned 
any arbitrary value, say, 100. The item for February is then 
obtained by applying the median link relative for February to 
this assumed value for January; the item for March by applying 
the March link relative to the chain relative for February; and 
so on through the several months of the year. In this way the 
link relatives are welded together into a chain running around 
the year from January to January. The results of this process 
appear in column (6) of Table 100 (p. 296). 

If, however, the number of link relatives is small, or if the 
median link does not appear to be thoroughly representative, it 
may be desirable in the interests of stability to take an arithmetic 
mean of a group of middle link relatives (say, the middle three or 
five if the total number is odd, or the middle two or four if the 
total number is even) rather than the median link.' 


1See W. L Crum, “The Use of the Median in Determining Seasonal Variation,” 
Journal of the American Statistical Association, March, 1923. 
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TABLE 97. Frequency DistriBurions oF Link ReLatives OF MONTHLY 
. Prc-IRON PRODUCTION, 1904-1914 * 


an. | Fes. | Mar.| Apr. | May | June | Jury | Aue. | Seer.| Oct. | Nov.} Dec. 
RELATIVES ons Jan. | Fes. | Mar. | Apr. | May | June | Jory | Auc. Sept.| Oct. | Nov. 


Over 125 | 
125 
124 | 
123 
122 
121 


I 

| 
120 | 
119 
118 
117 
116 
II5 l 
114 II 
113 | 
112 | | | 


Ill | 


110] | | 
109 T l i 
108 | | 

107 II I 

106 | 

105 
104 | I 
103 | | 

102 | 

ror | || | 


100 i 
99 
08 | tI Ill 
97 | | 

06 | | | 

os | | | | 


| ans 


link 99% 05 stage |) oye IOI 04 Ioo | 104 | 102 | 107 98 Io2 


* Here the cases are entered by mere tally marks. The year of the link relative may 
instead be recorded in the table. This permits of a somewhat more intelligent analysis of 
the distributions. 
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The chaining of the link relatives almost invariably discloses a 
discrepancy: the January relative obtained by applying the 
successive link relatives to the chain relatives of the twelve 
preceding months does not equal the January with which the 
chainis begun. This discrepancy may arise from slight influences 
of trend not completely eradicated from the link relatives,’ or 
from erraticness in the median link relatives as measures of 
typical month-to-month relationships. Whatever the origin of 
the discrepancy, it is to be removed. This may be done either 
logarithmically or arithmetically, that is, either on the assump- 
tion of the compounding of a constant degree of error, or the 
successive addition of a constant increment of error. 

Correction on the assumption of the compounding of a constant 
proportion of error is best carried through in terms of logarithms.? 
The nature of the computation is shown in the several columns of 
Table 98. (See page 293.) Column (0) gives the logarithms 
of the link relatives. The link relatives are easily chained by 
the addition of each logarithm to the sum of the preceding loga- 
rithms. The correction for any discrepancy is accomplished by 
dividing the discrepancy of the logarithms by twelve and then 
subtracting this fraction from the logarithm of February, twice 
this fraction from the logarithm of March, and so on through 
the twelve months of the year. Having adjusted the loga- 
rithms in this fashion for the discrepancy of the chain relatives, 
the adjusted chain indices, with January as a base, are found 
by obtaining the numbers corresponding to the logarithms, as 
shown in column (f).’ It is necessary, as previously noted, to 
have the average of all the relatives equal 100 since it is of the 
nature of seasonal variation that it represents merely disturb- 
ances of the level of the year and not change in the average of 
the year. The seasonal relatives are to be reduced to this base 
by simply dividing through by the average of the relatives as 


1 This is Professor Persons’ suggestion. 

2 The correction of the discrepancy logarithmically is preferred by Professor Persons. 

3 An alternative method of adjusting the median link relatives for their discrepancy 
when chained is shown in Table 99. I am indebted to Professors Hettinger and Piper of 
the Harvard Graduate School of Business Administration for this form. 
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first derived on the base of January equals 100. In the illus- 
trative case, this average is 91.7. Dividing the relatives shown 
in column (f) by this average gives the relatives shown in 
column (g). The twelve relatives taken together constitute 
the final index of seasonal variation. Such an index affords a 
statement of the degree to which each month may be expected, 
on account of seasonal influences, to deviate on the average from 
the mean of the year. 


TABLE 99. LoGARITHMIC CORRECTION OF MeEpIAN LINK RELATIVES IN 
CoMPUTATION OF SEASONAL INDEX: Form B 


(Imports of Crude India Rubber into the United States, 1901-1921) 


| LoGaARITHM 
ADJUSTED AD ere pres - 
MEDIAN pagers te me SAD oat a ee CHAIN RELATIVE | SEASONAL 
Montus Link RELATIVE WITH INDEX 
RELATIVE LINK ENS Link WITH JANUARY |(Ay= 
RELATIVE Factor | RELATIVE (Av.=100) 


JANUARY AS BASE 
)—(©* | as Base (Av. = 90.9) 


(= Ioo) 
(a) (b) (c) (d) (e) G)) (g) 
.0000 100.0 110 
F/J 105.5 2.0233 0059 0174 0174 104.1 114 
M/F 115.6 2.0630 .0059 0571 .O7A5 118.7 131 
A/M 90.6 1.9571 .0059 .9512 .0257 106.1 117 
M/A 85.2 1.9304 .0060 -9244 .Q501 89.1 98 
J/M 89.9 1.9538 .0059 .0479 .8980 79.1 87 
J/J 09.3 1.9969 .0059 .9910 .8890 77-5 85 
A/J O1.4 1.9609 .0059 .9550 .8440 69.8 77 
S/A 115.5 2.0626 .0060 0566 .9oo06 70.5 87 
O/S 108.1 2.0338 .0059 0270 .9285 84.8 03 
N/O 105.9 2.0249 .0059 .O190 -0475 88.6 08 
D/N 107.2 2.0302 .0059 0243 .9718 03-7 103 
J/D 108.2 2.0342 .0060 .0282 .0000 100.0 
24.0711 .O711 
24.0000 
12).0711 log discrepancy to be deducted 
.005925 monthly deduction 


* Characteristics are disregarded after test for discrepancy because their only use is 
to determine positions of decimal points, and these are apparent by inspection. All com- 
putations are carried through, however, just as if characteristics were present. 


Arithmetic (as distinguished from logarithmic) correction of the 
relatives has the virtue of simplicity, and appears to be logical 
as long as a straight-line trend is used in the analysis of the series. 
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If the correction of the discrepancy is made arithmetically, the 
total discrepancy is divided by 12 and 75 taken from February, 
ys from March, 3°; from April, etc., until +3, or the entire dis- 
crepancy, is taken from the second January item. ‘The several 
steps of the process are shown in full in Table roo. 

TABLE 100, ARITHMETIC CORRECTION OF MepiIAN Link RELATIVES IN Com- 


PUTATION OF SEASONAL INDEX 
(Monthly Pig-Iron Production, 1904-1914) 


Moe: | | EER ENE | UM BERCTEO | . Compzerrene’ | Tonacyen  /Raea 
Item RELATIVE CHAIN 
(per cent) RELATIVE ive (Aver. = 100) 
(a) (0) (c) (d) (e) 
January. 99.5 100.0 = IOI 
February . 05 95.0 1.05 93-95 95 
Marchi. 114 108.3 ZEEO 106.20 107 
AN oa V5 97 Og 3.15 IOI.95 103 
Maye nee 101 106.2 4.20 102.00 103 
June. . . 04 99.8 5.25 94.55 95 
aly 100 99.8 6.30 93.50 04 
August . . 104 103.8 7.35 06.45 97 
September . 102 105.9 8.40 97.50 08 
October. . 107 TTS 9.45 103.85 104 
November . 98 III. 10.50 100.50 IOI 
December . 102 113.2 IGS 101.65 102 
January . 99.5 112.6 12.60 100.00 
| 


A somewhat different method of measuring seasonal variation 
deals with the ratios of the monthly items to their respective 
ordinates of trend instead of with the link relatives.! In this 
method, a frequency distribution is made of the ratios of all 
the January items, another of the ratios of the February items, 
and so on for each month of the year. Arithmetic means are 
then taken of the middle groups (3, 5, or 7 items) in each of these 
frequency distributions. The twelve crude relatives thus 
obtained are finally divided by their average so that the mean of 
the twelve seasonal relatives becomes too. 

This method has the advantage of avoiding the rather laborious 
calculation of the link relatives; the ratios of the monthly items 


1 See Helen D. Falkner, ‘“‘The Measurement of Seasonal Variation,” Journal of the Amer- 
ican Statistical Association, June, 1924. 
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to their ordinates of trend have to be computed in the elimination 
of trend in any case so that the data from which the seasonal 
relatives are obtained are to be had without additional effort. 
This is doubtless a strong point in favor of the method. On the 
other hand, the individual ratios of the original items to trend 
do not represent seasonal relationships as clearly as do the link 
relatives; cyclical movements, for example, may certainly affect 
the ratios materially... The ratios are therefore less surely 
representative of seasonal variation. Possibly the averaging of 
middle groups of ratios meets this objection satisfactorily, but 
the method needs to be tested by extensive use before it is 
allowed to supplant the method of link relatives.” 

Seasonal relatives are designed to register the strength of 
typical month-to-month influences. Sometimes the relatives 
reflect factors of an entirely different sort. If the variable 
under analysis is in the form of a monthly aggregate, the typical 
month-to-month differences reflect in part differences in the 
lengths of the months; the January items in the series con- 
sidered above tend to be larger than the February items partly 
because January contains more working days than February. 
The extent to which this factor may influence the situation will 
appear if a comparison is made of seasonal relatives computed 
from the data of monthly pig-iron production and from the data 
of monthly daily average production of pig iron. Obviously, 
the former of these is directly influenced by the variable number 
of working days in the months, whereas the latter is not at all so 
affected. The seasonal relatives for the two appear in Chart 74. 

The differences here are marked and, for many purposes, of 
decided significance. Month-to-month differences arising from 
inequalities in the months are hardly true seasonal differences. 


11In other words, the dispersion of the ratios is greater than the dispersion of the link 
relatives. 

2 The method of monthly sums (or means) described earlier in this chapter is sometimes 
used in the analysis of data affected by both trend and cyclical influences. The method does 
not yield satisfactory results, however, if these non-seasonal influences, or other markedly 
irregular movements, disturb the seasonal relationships. True, the influence of trend may 
be extracted by adjustments of the monthly sums (or means) — see Davies, Economic 
Statistics, p. 119 -— but the distortion of the seasonal relatives by extreme items — whether 
due to cyclical or irregular causes — is practically ineradicable under this method. 
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From many points of view they are spurious. They should be 
carefully distinguished from true seasonal differences in any 
attempt at complete analysis of seasonal variation. 

The methods of measuring seasonal variation thus far con- 
sidered have assumed that the seasonal movement remains the 
same over the period of analysis. This assumption is not always 
warranted. The seasonal movement may differ from one year 
to another. Thus, in the bituminous coal industry the seasonal 


CuartT 74. SEASONAL INDEX FoR Pic-IRon PRODUCTION 


A. Computed from monthly production data. B. Computed from daily rate data. 
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irregularities of operation have been markedly different in the 
even and odd calendar years. In the even years, the biennial 
wage agreements have been negotiated, with the result that the 
April depression of production has been frequently accentuated 
through labor disturbances. Consequently the even years 
show a more pronounced seasonal than the odd years. Given 
such conditions, it is necessary to calculate separate seasonal 
indices for the even and odd years. 

A more common case of changing seasonal is that. in which 
the seasonal relatives show a tendency toward steady change in 
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one direction or another; in other words, the seasonal relatives 
exhibit a trend of their own. Consider a concrete illustration. 
Domestic consumption of gas has shown a strong seasonal move- 
ment in response to the varying amounts of light and darkness 
during the twenty-four hours of the day at different seasons of the 
year. The heaviest consumption has ordinarily been toward the 
close of December, lowest consumption at the end of June. 
This seasonal movement, however, is related to the consumption 
of gas for lighting purposes. The increasing use of the gas stove 
promises to change the situation. There is a premium of con- 
venience in the use of gas stoves in hottest weather. One of the 
large municipal companies has reported that in its experience 
the maximum consumption of gas has been during a couple 
of hours on a Sunday forenoon in summer.! The seasonal 
movements of gas consumption for lighting and gas consump- 
tion for cooking differ radically, with the result that the sea- 
sonals for gas consumption as a whole will be substantially 
modified by the increasing employment of gas ranges. Another 
interesting example of changing seasons is to be observed in the 
production of automobiles. Here, growing popularity of the 
closed car and increasingly generous credit accommodations for 
car buyers have decidedly modified the seasonal movement of 
car production. Other cases might be cited in which seasonal 
variation has undergone decided change. 

When seasonal variation is exhibiting a trend of its own, the 
seasonal relatives may be adjusted for trend along the lines in- 
dicated in the preceding chapter. Let relatives for the individual 
months be calculated as before.? Instead of throwing these into 
the form of a multiple frequency table, however, let the relatives 
of each month be plotted as a time series and analyzed for trend. 
The ordinates of the trend of the seasonals may then be sub- 
stituted for the median relatives in obtaining the seasonal index 
for any given year.’ In this way a set of seasonal relatives cor- 


1See Frank Popewell, ‘‘Seasonal Fluctuations in the Gas Industry,” Statistical Journal, 
June, rort, p. 721. 

2 The relatives may be either in “link” form, or in the form of monthly means, that is, 
relatives of the monthly item to the annual monthly average. 

3 The balance of the method may be left as before. The method, though serviceable, 
is open to the objection that extreme relatives may disturb the seasonal trend. 
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rected for the trend of the seasonals may be obtained for each year 
over any period in which the seasonal relationships are under- 
going modification.! 

The graphic representation of seasonal variation presents no 
new problems. The seasonal relatives are simply plotted asia 


CHART 75. PRODUCTION OF WISCONSIN CHEDDAR CHEESE, IQI1 


(Based on average taken from 16 factories) 
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Source: University of Wisconsin Agricultural Experiment Station, Bulletin 231, April, 
IQ13, P. 21. 


time series. Both vertical bar and line forms are satisfactory. 
The illustration given in Chart 75 serves to make the matter clear. 

Seasonal variation having been accurately measured may be 
readily eliminated from the series.2 This may be done by 
dividing each original item by the seasonal relative of the month 


1Qn this subject see W. L. Crum, ‘Progressive Variation in Seasonality,” Journal of 
the American Statistical Association, March, 1925, pp. 48-64. 

2If the seasonal variation is slight — say, not more than 2% on either side of the 
monthly mean — it may be safely ignored. 
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of the year for which the item has been reported. On the other 
hand, if the ratios of the original items to the ordinates of trend 
have already been computed in connection with the elimination 
of trend, the seasonal movement may be removed by dividing 
each of these ratios in turn by the seasonal relative of the month 
of the year for which the ratio is given. An even simpler method 
is available if the seasonal relatives fall within certain limits — 
say, 95 and 105, possibly 90 and 110. In this case, the seasonal 
relative may be subtracted from the ratio of the actual item to its 
ordinate of trend. Any one of these methods serves to take the 
seasonal variation out of the record of the variable. 

In dealing with any index of seasonal variation, however 
carefully derived, it should be remembered that individual 
seasonal relatives are essentially averages. It follows that 
they display the defects as well as the virtues of averages. 
A seasonal relative for February may fail completely to 
represent accurately the actual seasonal variation of a partic- 
ular February. A larger (or smaller) number of working days 
than usual may introduce an exceptional influence; or weather 
conditions may exert a depressing influence, seasonal in charac- 
ter, but more than ordinarily severe. Averages never represent 
extreme items; and seasonal relatives, being averages, fail to 
register extreme conditions. This fact should always be kept 
in mind in interpreting the results obtained in the standard 
analysis of seasonal influences. 


CHAPTER XIX 


RESIDUAL MOVEMENTS: CYCLICAL AND IRREGULAR 
FLUCTUATIONS 


In the two preceding chapters, methods of measuring trend and 
seasonal variation have been described. Once these two varieties 
of movement have been measured, it is comparatively simple to 
extract both from the series. The residuum, in the case of most 
economic and business series, is a composite of cyclical and 
irregular fluctuations. It is the purpose of the present chapter 
to analyze these residual elements. 

The general characteristics of the residuum left after taking 
trend and seasonal variation from a time series may be most easily 
seen in a concrete case. Consider the data given in Chart 76.1 

The general character of the deviation items is fairly evident. 
In the first place, the series in Chart 76 is marked by a definite 
wave-like movement. There are crests in 1906-7 and late 19009, 
and an equally definite trough in 1908. In the second place, 
there are occasional sharp, jagged breaks in the series; the most 
marked of these are to be observed in the summer of 1906 and 
spring of 1909. Finally, there are a large number of minor fluctu- 
ations appearing as slight irregularities in the general course of the 
variable. These three varieties of movement may best be dealt 
with separately. It is most satisfactory to consider first the 
more pronounced irregular fluctuations which involve sharp 
breaks in the series. As previously suggested, these may be 
referred to as episodic movements. 


A. Episopic MOvEMENTS 
Episodic movements fall into the general category of irregular 
fluctuations, but they differ from other irregular fluctuations 


1 For the data in full, see column (e) of Table ror, page 309. 
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in that in each case the immediate cause of the movement is 
specific, and ordinarily known. Take, for example, the sharp 
break in the fall of 1919 in the production of pig iron (see Chart 
78). This sudden decline in relative output was occasioned by 
a serious strike in the steel industry. The strike began in late 


Caart 76. Deviations oF MontHty Pic-IRoN PRopuUCTION ABOVE AND 
BELOW TREND ADJUSTED FOR SEASONAL VARIATION, 1906-1909 
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Nore: The zero line of a chart of this sort is sometimes called the ‘‘normal line,”’ the 
idea being that the line indicates the middle position above and below which the oscillations 
of the business cycle take place. The zero line represents, in other words, the condition 
which would prevail if no cyclical or irregular influences disturbed the variable. According 
to the Harvard Economic Service, ‘“ ‘normal’ conditions are those which would prevail if the 
business cycle were eliminated and all sporadic and irregular factors — such as strikes — 
entirely avoided. Even under these conditions there would be room for substantial changes 
ee elem ay a over the years, for long-time tendencies of growth or decadence, and short- 
time influences of a seasonal character would still be operative. ‘Normal’ is a composite of 
trend and seasonal variation. The net effect of the business cycle is to carry conditions 
in any line of industry now above, now below, normal. In brief, ‘normal’ is neither the 
animation of high prosperity and the subsequent boom, nor the lethargy of deep depression 
and the first stages of recovery. Rather it is a middle condition above and below which the 
fluctuations of the business cycle take place.” Review of Economic Statistics, vol. V, p. 32. 
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September and resulted directly in a sharp decline of steel out- 
put and consequently of pig-iron production. Later a strike in 
the soft coal field was a contributing factor. The effects of the 
two strikes in conjunction are seen in the break in pig-iron 
production. 
Another illustration of the same kind of fluctuation is given 
in Chart 77: 
NS 
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Source: Statistical Bulletin of the Company, vol. I, no. r. 


The sharp peak shown in October-December by the curve for 
1918 reflects a definite episode — the influenza epidemic of this 
period. The movement registers the force of this factor, and 
practically this factor alone. 

Methods of measuring irregular fluctuations of this sort have 
to be, from the nature of the case, more or less arbitrary. Each 
case is essentially unique. Approximate measurements never- 
theless are possible by relatively simple means. The primary 
requisite is fairly accurate knowledge of what would probably 
have been the course of the variable had not the disturbing 
episodes occurred. This knowledge may be obtained ordinarily 
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from careful study of the secular trend, seasonal variation, and 
cyclical fluctuations of the series.!_ Since the method involves 
at best a mere approximation, a free-hand smoothing of the 
curve will usually serve to register the course the variable would 


CHART 78. EFFECTS OF STRIKES OF 1919 ON MONTHLY PRODUCTION 
oF Pic IRon, 1919-1920 
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have followed had it not been disturbed by the specific factor, or 
factors, in question. 

The procedure may be illustrated by further examination of the 
case of pig-iron production in 1919. Consider the curves shown 
in Chart 78. The course the variable would probably have taken 


1 Tf any of these movements are absent in the series, the problem is that much simpler. 
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but for the steel and coal strikes is indicated in the dotted line. 
It is reasonable to suppose that pig-iron production would not 
have risen quite as high as it did in August, 1919, if the steel 
strike had not been in the offing. Furthermore, it probably 
would not have reached quite as large a figure as it did in the 
winter of 1919-20 if pressure on the mills had not accumulated 
as a result of the strike. The underlying tendency of the period 
was undoubtedly, as a result of cyclical influences,’ in an upward 
direction. The smoothed curve gives a reasonable picture of pig- 
iron production as it would have been if the strikes had not 
occurred. 

The differences between the smooth curve and the plot of actual 
production measure the extent of the disturbance. This is 
represented in the chart by that portion of the curve which has 
been cross-hatched. The measurement is admittedly some- 
what arbitrary and at best only approximate. It gives, however, 
a reasonably clear and dependable picture of the form and magni- 
tude of the episodic movement. 


B. CycLicaAL FLUCTUATIONS 


Cyclical fluctuations involve a succession of ups and downs 
which give the movement as a whole a definite wave-like char- 
acter. In this respect, cyclical fluctuations resemble the periodic 
movements considered in the preceding chapter. But cyclical 
fluctuations do not show approximate repetitions of a single 
shorter cycle; they do not exhibit movements of any known, 
fixed periodicity.2, The lengths of the cycles, as well as their 
amplitudes, vary considerably. Cyclical fluctuations conse- 
quently do not lend themselves to the analysis outlined for 

1 No allowances need be made for secular and seasonal factors since the relatives have 
been adjusted for these. 

2 A simple test of periodicity may be mentioned at this point. If the first item in a time 
series is successively correlated with every second, third, fourth, etc., the resulting correla- 
tion coefficients will reveal any definite periodicity in the items; for it is clear that the 
coefficient will reach a maximum when the items correlated tend to fall more or less con- 
sistently in a single phase of the cycle. (See P. G. Wright, ‘‘Review of H. L. Moore’s 
Economic Cycles,’’ Quarterly Journal of Economics, May, 1915, pp. 634-635.) While the re- 


sults given by this relatively simple method are by no means conclusive, they raise definite 
presumptions concerning the presence or absence of periodicity in the series. 
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seasonal variation and other periodic movements; they require 
treatment by other methods. 

The simplest means of obtaining a clear image of cyclical 
fluctuations is to take a centered moving-average of the devia- 
tions from trend after these have been adjusted for seasonal varia- 
tion! The spread of the moving-average may be made almost 
any fixed number of months, say, 3, 5, 7, or 9. In general, the 
practice should be to use the shortest wave-length that will 
smooth out the minor irregularities of the curve.” A 5-months 


CHART 79. FrvE-MonTHs Movinc-AVERAGE (centered) OF CYCLICAL 
FLuctuaTions oF Montuty Pic-Iron PRODUCTION, 1905-1915 
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average will commonly be found satisfactory. An illustration 
of a moving-average used to portray cyclical fluctuations is given 
in Chart 79. 

The cycle curve here shown serves to bring out clearly the 
important fact that both the duration and the amplitude of 
cyclical fluctuations vary considerably. The average duration 
of the business cycle in the United States appears to be about 34 
years, or 42 months, but cycles as short as 28 months and as 
long as 48 have been experienced. In general, it is to be con- 
cluded that cyclical fluctuations have no period which is as yet 

1 Additional ‘‘ corrections” for episodic movements may be desirable. 


2 It is to be noted, however, that the use of a r2-months average gives added assurance of 
the complete elimination of seasonal variation. 
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clearly assignable, and that, consequently, simple methods of 
analysis which presuppose a fixed periodicity are not applicable 
to this type of fluctuation.’ 

The amplitude of cyclical fluctuations is another variable 
feature. The range of fluctuation not only differs from one cycle 
to another in a single series; it differs also, and even more con- 
spicuously, from one series to another. Consider, for example, 
CuHart 80. COMPARISON OF CYCLICAL FLUCTUATIONS OF PRICES OF COMMODITIES 

(Bradstreet’s Index) AND oF TWELVE InDusTRIAL Stocks (Wall Street 
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Source: Review of Economic Statistics, prel. vol. I, pp. 71, 168. 


the cyclical fluctuations shown in the two curves of Chart 80. 
The fluctuations of wholesale prices obviously occupy a much 
narrower range than those of security quotations. 

In certain lines of analysis it is important to know the average 
amplitude of fluctuation of the cycle movements of different 
series. This feature of the series is best expressed in their respec- 
tive standard deviations. The calculation of o in this case 
follows familiar lines. An illustrative case is given in Table ror. 


1 Whether there is periodicity in the cycle is a controverted point. Elaborate mathe- 
matical methods (notably the harmonic analysis) have been employed in the attempt to 
demonstrate periodicity. Thus far the results obtained appear to be inconclusive. In any 
case, these more elaborate methods lie beyond the scope of the present treatise. 
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TABLE 101. CALCULATION OF STANDARD DEVIATION OF THE MONTHLY 
IreMs OF A TIME SERIES 


(Pig-Iron Production in Thousands of Long Tons, 1906-1909) 


SEA- 
ORDINATE|] RELA- DEVIATION 
ORIGINAL SONAL | Devia- | DevraTIon 

Monte OF TREND|TIVE (a In Unrts 

ITE (1904-14) | TO oe onece TIONt | SQUARED one 
(a) (0) (c) (d) (e) (f) (g) 
Io06 ~January .. 2068 1799 IIS IOI 14 196 6 
February . . 1904 1804 105 05 10 100 4 
Marches) ene Pins 1810 119 107 12 144 & 
April eee ine 2073 1815 II4 103 II 121 5 
IME oe op 2008 182T II5 103 12 144 5 
Une we come 1976 1826 108 05 13 169 6 
Tul eee eae 2013 1832 IIo 04 16 256 7. 
August . . 1926 1937 0) | LOS 07 8 64 4 
September . 1960 1842 106 98 8 64 4 
October . . 2196 1848 IIQ 104 I5 225 7 
November. 2187 1853 118 IOI 7 289 ip 
December . 2235 1859 120 102 18 324 38 
1907 January . . 2205 1864 118 IoL 17 289 oh 
February . . 2045 1870 109 05 14 196 6 
March 2... 2226 1875 II9Q 107 12 144 a5 
Loyal a, 2216 1881 118 103 15 225 7 
Maya al on as 2205 1886 122 103 19 361 8 
‘JUneys fon 2234 1802 118 05 23 520 1.0 
july 34 = 2255 1807 119 04 25 625 Ten 
August . . 2250 1902 118 07 aI 441 xe) 
September . 2183 1908 115 98 17. 289 ty 
October . . 2336 IQI3 122 104 18 324 8 
November . 1828 IQIQ 95 IOI — 6 36 — 3 
December. 1234 1924 64 102 — 38 1444 — 17 
1908 January . . 1045 1930 54 Tox — 47 2209 — 2.1 
February . . 1077 1935 56 05 — 30 1521 — 1.7 
Marchi2. 2). = 1228 1941 63 107 — 44 1936 — 1.9 
iNoyoll . go coke © 1149 1946 59 103 — 44 1936 — 1.9 
May ova fs 1165 IQSI 60 103 — 43 1849 — 1.9 
JUNE ae 1092 1957 56 05 — 39 1521 —17 
Wilby ge 1218 1962 62 04 — 32 1024 — 1.4 
August . . 1348 1968 69 07 — 28 784 — 1.2 
September . 1418 1073 72 08 — 26 676 — 11 
October . . 1563 1070 470 104 — 25 625 — 1 
November . 577i 1984 70 IOL — 22 484 — 1.0 
December . 1740 1990 87 102 — 15 225 -— 7 
r9g0g January . . 1801 1995 90 101 — 11 121 — .5 
February . . 1703 2001 85 95 — 10 100 — 4 
Marchesa 1832 2006 QI 107 — 16 2560 —.7 
‘April ee 1738 2011 86 103 —17 2890 — 7 
WER AM ko 1880 2017 03 103 — 10 100 — 4 
June econ 1929 2023 905 05 ° ° Ko) 
July os 2101 2028 104 04 10 100 4 
August. . 2246 2033 III 07 14 196 6 
September . 2385 2030 117 98 19 361 8 
October: 2600 2044 127 104 23 520 1.0 
November . 2547 2050 124 IOI 23 520 1.0 
December . 2035 2055 128 102 26 676 ni 
25046 

The sum of the deviations squared (2x?) = 25,046. The number of terms (N) = 48. 


De = 9/2524 =V 521.8 = 22.8 


8 
* See Table 100. . 4 é ‘ : 
+ The deviation items have been obtained by first computing the ratio of each monthly item to 
its ordinate of trend and then subtracting from this result the seasonal relative of the month in ques- 
tion. It is convenient to refer to items of this sort as ‘‘ adjusted relatives.” 
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In column (g) of Table ror the deviations of the monthly items 
are given in units of their own standard deviation. Items in 
this form have been called by Professor Persons “‘cycle”’ figures. 
They entirely eliminate from the record of the fluctuations of any 
number of series those differences due to varying amplitudes 
of cyclical movement. Where amplitude of fluctuation is not a 
consideration, therefore, attention may well be focused on the 
so-called ‘‘cycle”’ series. 

Measurement of the duration and amplitude of the cyclical 
fluctuations completes their analysis, save as comparisons and 
correlations of two or more are undertaken. In this case, 
additional steps are in order. These will be considered in the 
next chapter. 


C. ACCIDENTAL FLUCTUATIONS 


As previously noted, practically all time series exhibit minor 
irregularities having their origin in unknown causes. These do 


TABLE 102. IRREGULAR FLUCTUATIONS OF VALUES OF BUILDING PERMITS 
IssUED FOR Twenty LEADING CITIES, JULY, 1903-JUNE, 1916 * 


TRREGULAR FLUCTUATIONS FREQUENCIES 
(Percentages) (Yearly Total) 
+ 46.0 I 
IF Oso I 
x iar 2 3 I 
2 + 21.9 4 
e + 17.9 8 
5 ae Skt) Io 
B + 9.9 15 
A =P SS 23 
= + 1.9 23 
e et aE 19 
5 SOLU 19 
AS Own 15 
4 Pr And 9 
a = Ks 4 
= + Aor 3 
= Api I 
Total 156 


* Details of the derivation of these residuals are given in the Review of Economic Statistics, 
prel. vol. I, pp. 137-138. 


RESIDUAL MOVEMENTS exit 


not ordinarily entail any serious disturbance of the variable and 
do not require separate treatment. If necessary, they may be 
eliminated from the record, usually by some simple device such as 
a moving-average. 

CuarT 81. RECTANGULAR DIAGRAM OF THE IRREGULAR FLUCTUATIONS OF 


MONTHLY VALUES OF BUILDING PERMITS ISSUED FOR TWENTY LEADING CITIES 
AND THE NORMAL CURVE OF DISTRIBUTION FITTED THERETO * 


x = irregular fluctuations in units 
of 4% 
y = frequency 


x2 


Y y = 21.7¢ +648 equation of nor- 
mal curve based on 156 items with 
Sheppard’s correction of the stand- 
10 aie 
ard deviation 
| 
0 
-7-6 -5-4-3-2-1 0 +1 +2 +3+4+5+6+7 +8+9 +1041 


x 


* Trregular fluctuations secured by correcting original items for secular trend, cyclical fluctuations, 
and seasonal fluctuations. 


It may be worth while, nevertheless, to note the underlying 
character of fluctuations of this sort. In fact, these accidental 
fluctuations, when their relationships in time are ignored, conform 
to the curve of error. A concrete case will serve to illustrate. 

In Table 102 and Chart 81 are shown the irregular residual 
fluctuations of the monthly values of building permits issued for 
twenty leading cities as computed by Professor Persons for the 
period June, 1903, to June, 1916. These irregular fluctuations 
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represent the residuum after the elimination of secular, seasonal, 
cyclical, and episodic movements. It is clear enough that the 
fluctuations distribute according to the normal law of error. It 
is thus evident that they have their origin in small influences of 
an accidental character, operating on the more important factors 
which give the time series its chief features. 


We have now completed the analysis of the several types of 
movement which appear in time series. These have been 
examined both in isolation and in combination. Now that the 
nature of the different movements is thoroughly understood, it 
is desirable to consider the means by which these movements in 
different series are to be compared and the extent of their re- 
semblance measured. This subject will be taken up in the next 
chapter. 


CHAPTER. XX 


CORRESPONDENCE AND CORRELATION IN 
TIME SERIES; LAG 


THE full import of variations in time is rarely evident until 
two or more series have been brought together and their fluctua- 
tions carefully compared. Analysis along this line may endeavor 
to answer two distinct questions: (1) to what extent do the 
movements of the two or more series correspond? (2) in what 
time relationship do the two or more series appear to be in 
closest correspondence? These questions will be considered in 
the two sections of the present chapter. 


A. THE EXTENT OF CORRESPONDENCE OR CORRELATION 


The simplest way in which to examine the extent of correspond- 
ence between two time series is to plot the two in their original 
forms on the same grid. The method is rudimentary but never- 
theless serviceable. 

If the two variables to be compared are given in the same unit 
(or in units which by simple conversion can be made the same), 
the direct comparison by graphic means is perfectly straight- 
forward and unambiguous. But commonly the two series to be 
compared are not expressed in the same unit; then the serious 

problem of choosing appropriate scales is encountered. Almost 
any image may be obtained by manipulating the vertical scales. 
The dangers of downright misrepresentation are serious. The 
chances of misrepresentation, however, are greatly reduced if the 
zeros of all vertical scales are located on the horizontal axis. 
“Subject to this one condition the vertical scales may be adjusted 


1See Bowley, Elements of Statistics, pp. 149 et seq.; Elementary Manual of Statistics, 
chap. V. 
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in various ways: (1) so as to bring the later parts of the curves 
into close proximity ; (2) so as to bring the earlier parts together ; 
(3) so as to place the averages of the two variables on the same 
level. In general the scales are to be so selected as to aid com- 
parison where the relationships of the variables are most signifi- 
cant. Chart 82 affords an illustration. 

Direct graphic comparison of two or more time series may be 
simplified by reducing the items to relative form. Suppose two 


CHART 82. COMPARISON OF YEARLY PRODUCTION OF HAY AND OATS 
IN THE UNITED STATES, 1900-1915 
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series in different units are to be compared over the period tg00- 
1924. Let the 1900 item of each series be 100; then express all 
later values of the variable as relatives of this rg00 base item. 
The items in this relative form are in the same unit and directly 
comparable. The problem of divergent scales is thus avoided. 
Neither of these two ways of comparing series is entirely 
satisfactory, however, when either or both of the series is a com- 
plex of different varieties of movement. If this is the case, the 
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several varieties of movement are best separated before graphic 
comparison is undertaken. One trend may be contrasted with 
another, one seasonal index with another, one series of cyclical 
fluctuations with another, all without serious difficulty. In the 
comparison of seasonals and cycles, identical scales are always 
available; while in the comparison of trends, satisfactory results 
are obtained if any divergence of units is dealt with in one of the 
three ways suggested above.! 

Though graphic comparison may throw no little light on the 
extent to which time series move together (or in opposition), it 
rarely gives an adequate indication of correspondence or corre- 
lation. Mathematical indices affording definite quantitative 
measurements are usually desirable. A number of mathematical 
devices have been used for this purpose. Only two or three call 
for close examination. 

A measure of correspondence may be derived, in the first place, 
from the directions of change in the successive items of the two 
series under examination. Thus, let every increase in either 
series be represented by + 1 and every decrease by — 1. Then, 
if these plus and minus items for one series are multiplied by the 
‘corresponding items of the other, + 1 is obtained for every case 
in which the movements are in the same direction and — 1 for 

“every case in which they are in opposite directions.? If the alge- 
braic sum of these items is then obtained and this total finally 
divided by the number of pairs of items included, the result is an 
index of the extent to which the two series move in sympathy. 
Table 103 shows an example. 

Like the correlation coefficient, this crude measure of corre- 
spondence takes the limiting value of + 1 when the two series 
invariably move together, whichever their direction, and — 1 
when they invariably move in opposite directions. Conditions 
between these extremes will yield measures of correspondence 
between + 1 and — 1, zero representing a relationship in which 


1 In the consideration of time series exhibiting trends in combination with other varieties 
of movement, examine Chart 59 (page 244). 
2 For cases in which one or the other of the variables does not change a zero is recorded, 
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the two series move in opposition the same number of times 
that they move in sympathy. For a quick test of correspond- 
ence in time series the measure is serviceable. 


TABLE 103. CALCULATION OF CruDE MEASURE OF CORRESPONDENCE 
BETWEEN Two TIME SERIES 


Monts I aoe oo BRADSTREET’S Montuty INDEX oF Commopity PRICES + 
YEAR : 
Paittons "e| Decrease x | Unit: $1.00 | Decrease ~ x OX@ 
(a) (b) (c) (d) (e) 

1903 18.01 " 7.04 
4 16.50 ris 7.92 —1 tay 
5 22.99 Pu 8.10 dbs i +e 5 
6 25-31 Ht 8.42 +1 to 
iy 25.78 Gee 8.90 ae abn 
8 15.94 Say 8.01 an Sear 
9 25.80 “i 8.52 eg wee, 
IgIo 27.30 +1 8.99 Jb 7 seg 
I 23.05 Set 8.71 asuy CEs 
2 20.73 ap 9.19 +1 ae a! 
3 39.97 si 9.21: ab 5 aL, 
4 23.33 —1I 8.90 ey ey 
s | 29.92 +1 0.85 Ls i 
6 39-44 +4 11.82 +1 Sy 
13 

13 7s 


* Source: Edmund E. Day, ‘‘An Index of the Physical Volume of Production,” reprint 
from Review of Economic Statistics, p. 18. 
{ Source: Review of Economic Statistics, prel. vol. I, p. 70. 


The obvious objection to the crude measure of correspondence 
is that it does not take into account the magnitude of the changes 
of the two variables; a slight change either upward or downward 
counts equally with a very large change; a small disagreement 
may offset a large agreement. This objection may be met by 
expressing the changes of the two series relatively! But the 
measure in this relative form requires much more time in com- 


1See L. March, “Comparaison numérique des courbes statistiques,” Journal Société de 
Statistique de Paris, 1905, pp. 255 et seq.; Armand Julin, Principes de Statistique Théorique et 
Appliquée (Bruxelles, 1921), pp. 489 et seq.; J. D. Magee, ‘“‘ The Degree of Correspondence 
between Two Series of Index Numbers,” Quarterly Publication of the American Statistical 
Association, June, 1912, pp. 174-181. 
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putation and is of doubtful value at best.! If a more careful 
analysis is to be made of the relationship between two time series, 
it is best to compute a coefficient of correlation. 

The correlation of two or more variables always depends upon 

the definite linking or pairing of the variables which are under 
observation. In the correlation of variables in time the pairing 
is in time; the items are of the same periods of time, or are of 
~~ periods of time which are related to one another in some uniform 
fashion. Thus, we may compare price and production of pig 
iron for identical months, or we may compare the price of pig 
iron with the production of pig iron of the preceding month. As 
long as the same time relationship between the two or more 
variables is preserved throughout the analysis, the variables 
may be said to be paired in time. 
In the application of the correlation analysis to time series, 
special care must be exercised. In particular, great caution is 
necessary in the interpretation of the results. It must always be 
remembered, for instance, that time series are usually composites 
of a number of distinct varieties of movement, and that the size of 
the coefficient may reflect connections between movements other 
than those directly under investigation. Not infrequently the 
clear correlation of one variety of movement in two time series 
is so offset by the counter-influences of other movements that a 
coeficient calculated from the series as originally stated will 
show no correlation at all. 

This difficulty is sometimes overcome by what is known as the 
“method of first differences.” ? Suppose, for example, that an 
analysis is undertaken to test for correlation the short-time 
fluctuations of wholesale prices and merchandise imports. All 
influences of trend are to be eliminated. The first step may be 
the subtraction of each item of both series. from the succeeding 

1 For a discussion of the merits and defects of the form suggested by Magee, see W. M. 
Persons’ Review of Magee’s “‘ Money and Prices: A Statistical Study of Price Movements,” 
Quarterly Publication of the American Statistical Association, March, 1914; J. D. Magee, 
“Note in Reply to Persons’ Review,” ibid., Dec., 1914; W. M. Persons, ‘‘Rejoinder to 
Magee,” ibid., Dec., 1914. 


2 Suggested by R. H. Hooker in a paper “‘ On the correlation of successive observations 
illustrated by corn prices,”’ Statistical Journal, vol. 68, 1905, pp. 696 et seq. 
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item. These first differences will be largely free from the trend 
influence. If the correlation analysis is applied, then, to the 
first differences instead of to the original items, the resultant 
coefficient will register the relationship between short-time, not 
long-time, forces.? 

But first differences themselves may be mixtures of different 
varieties of movement.? To make the correlation analysis 
perfectly explicit it is best to apply it to series expressing one, and 
only one, type of fluctuation. When series from which trends and _ 
seasonal variations have been removed are correlated, the mean- 
ing of the coefficient is much more likely to be clear. 

The computation of the correlation coefficient from series of 
cyclical fluctuations from which trend and seasonal variation have 
been removed, is relatively simple. If the cyclical fluctuations 
of each variable have been expressed in units of the variable’s 
standard deviation, this is particularly true.? In this case the 
individual items are already in ‘‘reduced” form; that is, in the 
form of = and a The correlation coefficient is given, there- 
fore, by the mean cross-product of these items, paired in the 


z 


oy <a 
X and ¥ series; i.e. by = oy 


Table 104 affords an illustration of the method. 

Suppose, however, that the correlation analysis is to be applied 
to two time series not in “cycle” form. In this case resort may 
be made to the general formule for the correlation coefficient. 
It is to be noted, however, that the data do not take the form 
of a correlation table; each item occurs but once and the prob- 


1 Of course, the first differences may be stated relatively as well as absolutely. 

2 As a result, the ‘‘difference’’ method encounters serious difficulties under some con- 
ditions. See W. M. Persons, “On the Variate Difference Method and Curve Fitting,” 
Quarterly Publication of the American Statistical Association, June, 1917, pp. 602-642; also 
G. U. Yule, ‘The Problem of Time Correlation with Especial Reference to the Variate 
Difference Correlation Method,” Statistical Journal, July, 1921, pp. 497-537. 

3 Items in this form are the “cycle” figures referred to in the preceding chapter. See 
Table ror. 
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TABLE 104. CALCULATION OF CORRELATION COEFFICIENT FROM ParrED Data 
IN “CycLeE” Form 


(Monthly Outside Bank Clearings and Pig-iron Production, 1904-1907) 
Deviations from trend corrected for seasonal variation, in units of o * 


OvuTsIpE BANK Pric-IRoN 
| CLEARINGS PRODUCTION Cross PRopuct 
Mont So (2) (=) 
or oy fp EY ; 
T1904 January . — .6 — 2.0 1.20 
February — 1 — 8 08 
March — 5 = .6 230 
April — .6 — 1 .06 
May =) y/ ie} .2T 
June —.5 — 8 40 
July . 1} = 4 .08 
August — 4 — 1.3 Ge) 
September . — .2 — 6 52 
October. . — 4 — 7 28 
November . 4 — 3 — .12 
December 4 $i 04 
1905 January . — .5 6 — .30 
February — 1 *3 = 8 
March 4 7 .28 
April . a 8 .08 
May . 8 8 64 
June . 4B 6 30 
July . ail a 05 
August : .6 7 42 
September . 5 8 40 
October. . 4 re) .36 
November . 7 T.0 70 
December 6 amas .66 
1906 January . 1.4 1.2 1.68 
February 8 Xe) 72 
March ty I.0 .70 
April . 2 49) 18 
May . yf Xe) .63 
June . 6 8 48 
July 4 se) 36 
August i 8 6 48 
September . aa 7 07 
October. . 1.0 1.0 1.00 
November . fe) 1.3 De07 
December 6 1.3 78 
1907 January . ipl 1183 1.95 
February xe) 1.0 .90 
March 1.2 me) 1.08 
April . T.0 1.0 1.00 
May . a5) Del 1.65 
lines Ay 1.3 QI 
July ie 1.4 1.82 
August a I.0 ia 1.20 
September. . = . 4 1.0 40 
Octobeminuea thee ing igi 1.43 
November... . — 2.0 — 20 
December — 2.7 —1.5 4.86 
33.28 


* For Outside Bank Clearings, o = 8.62; for Pig-Iron Production, o@ = 19.15. 


© oY ) 
2) a NE 
Ox Ty 33.28 6 
| — ey 
N 48 : 


Source: Review of Economic Statistics, prel. vol. I, pp. 190, 192. 
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lem of dealing with multiple frequencies does not arise. Under 
these conditions, certain computation forms possess advantages 
over others. 

One of the most convenient methods of computation rests upon 
the formula : 

Ne >(X — A)(Y — B) — N(M, — A)(M, — B) 

V3(X — A? — N(M, — AP V=(Y — B)? — N(M, — B)? 
where X and Y are, as before, the original items of two paired 
series; M, and M,, their arithmetic means; WN, the number of 
items in each; and A and B, two arbitrary origins, A in the X 
series, and B in the Y series.} 

The values of the terms of this formula are readily obtained 
from a table in the form of 105. The total of column (c) = 
YX — NA. This divided by N gives (M, — A). Similarly the 
sum of the items of column (d) divided by N gives (M, — B). 
The total of column (e) is 2(X — A)(Y — B); of column (f) is 
=(X — A)?; of column (g) is 2(Y — B)*. In the case of the 
data of Table 105, 


=(X — A)(Y — B) = 6556 


2X —NA _ — 269 
z-A >= ———_. = = 4 
uy N 36 7s 
ZY—NA = — 129 
M,-—B= = =— 3. 
i N 36 3-5 


2(X — A)? = 16509 
DY B= 23082 
oa O550%= 13 0(= 27-5) 355) 
V'16509 — (36 X 56.25) V3083 — (36 X 12.25) 


- 6550 — 945 
V'16509 12025 V'3083 7 Ae 
5611 SOs Lie 


- V14484 V 2642 ction e 


1 This is one of the formule explained in the earlier discussion of correlation. In other 
symbols it is: 
Lax'y’ — Nexcy 


r= 
V(x)? _ New V X(y’)2 — No? 
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TABLE 105. CALCULATION OF CORRELATION COEFFICIENT FROM TIME SERIES — 
First MretTHop 


(Index Numbers of Monthly Employment in Iron and Steel, and Chemical, Oil, 
and Paint Industries) 


TRON AND CHEMICAL, 
AG ON AND oc es he tee MAME Bao (X — A) i ; 
a Iwpex | Iwpex | Onrom (MARY Onrens (yyy) CS A" —2) 
(Y) | G= too) = 100) 
(Leg 2) 
(a) | @) () (4) Cin aeen) 6. 
Ig1g January .| 118 104 18 4 72 324 16 
February .|. 112 102 iy) 2 24 144 4 
March. .| 108 99 8 = — 8 64 I 
ieoyatl 4 |}, HON 98 5 —2 — 10 25 4 
\Y, Eh) a Kose) 07 ° —3 ° ° 9 
Une ae le OO. 96 ° = Ah ° ° 16 
jRike = si) Giew 99 3 —1 —3 9 I 
August .| 105 100 is ° ° 25 ° 
September | 106 102 6 2 12 36 4 
October . 70 IOI — 30 I — 30 goo I 
November 81 103 — 19 3 — 57 361 9 
December 98 102 —2 2 —4 4 4 
Ig20 January .| 110 104 10 4 40 100 16 
February .| 112 104 12 4 48 144 16 
Marchieseea|| ens 105 13 5 65 169 25 
bNovatl 3 sais 104 15 4 60 225 16 
Maven elmer OO 102 6 2 12 36 4 
qi, g || aa 103 i2 3 36 144 9 
july) eo ee or 106 II 6 66 ean 36 
August .| 108 106 8 6 48 64 36 
September | 111 106 II 6 66 121 36 
October .| 110 106 10 6 60 100 30 
November 107 102 7 2 14 49 4 
December 98 96 —2 —4 8 4 16 
1g2t January . 81 92 — 19 == (5) 152 361 64 
February . 81 QI — 19 = 9 171 361 81 
March. . 77 go — 23 — 10 230 520 100 
bWoyatl 72 88 — 28 — 12 336 784 144 
WER ase 69 87 = Bil = 1g 403 961 169 
Nunes eee 66 84 — 34 — 16 544 1156 256 
July) ch 57 83 = 43 = 7 731 | 1849 | 289 
August. 58 82 — 42 — 18 756 1764 324 
September 59 82 — 4I — 18 738 1681 324 
October . 62 83 — 38 —17 646 TA44 289 
November 65 82 — 35 — 18 630 1225 324 
December 65 80 — 35 = 20 700 1225 400 
— 269 — 129 | 6,556 | 16,509 | 3,083 


Source: Review of Economic Statistics, prel. vol..V, p. 298. 
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Another serviceable computation form is based upon the 
formula for r suggested by Leonard P. Ayres :! 


Dae 
DRO || ==) 
3 eo 


The terms required for determining 7 from this formula are 
readily obtained as indicated in Table 106. 


Dy = 93327 ZY = 3471 
DX? = 329,700 ZY? = 337,283 
DX pn 

ee. gel OG 

N 92.5 NV 90.4 


ZXY = 326,756 
LE 326,750 — 92.5 X 3471 
V (322,709 — 92.5 X 3331)(337,283 — 96.4 X 3471) 
= R20.750' sa 3 21tO07 5 
V (322,709 — 308.118)(337,283 — 334,604) 


ie 


= 5688 a 5688 
V14591 X 2679 V39089289 

B55 088 7 

) 6252 — ta 


This computation form avoids the complication of deviations with 
their unlike signs, as well as all corrections onaccount of differences 
between assumed and true averages. It has the further merit of 
facilitating the use of various mechanical devices; the squares 
required by the formula are readily taken from tables of squares, 
and the products and sums are easily obtained on computing 
machines. The form frequently may be used to advantage. 


B. Time RELATIONSHIP: LEAD AND LAG 


The pairing of items which underlies every analysis of correla- 
tion in time series need not be of simultaneous items. Instead 
the items of one series may be linked with items of the other at 


1 See “A Shorter Method for Computing Coefficients of Correlation and of Regression 
and the Correlation Ratio,” Journal of Educational Research, March, 1920. 
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TABLE 106. CALCULATION OF CORRELATION COEFFICIENT FROM TIME SERIES — 
SEconpD (or Ayres) METHOD 


(Index Numbers of Monthly Employment in Iron and Steel, and Chemical, Oil, 


and Paint Industries) 


TQI9 


1920 


1921 


CHEMICAL, 
Montu sneer Or ANE ANG x2 y2 AY) 
2) 
(a) (0) (c) (d) (e) 

January 118 104. 13,924 | 10,816 12,272 
February . 112 102 12,544 10,404 11,424 
March . 108 99 11,664 9,801 10,692 
April 105 98 11,025 9,604 10,290 
May 100 97 10,000 9,409 9,790 
June 100 96 10,000 9,216 9,600 
July 103 99 10,609 9,801 10,197 
August 105 100 11,025 10,000 10,500 
September 106 102 11,236 10,404 10,812 
October 70 IOI 4,900 10,201 7,070 
November 81 103 6,561 10,609 8,343 
December 98 102 9,604 10,404 9,996 
January 110 104. 12,100 10,816 11,440 
February . 112 104 12,544 10,816 11,048 
March . 113 105 12,769 11,025 11,865 
April II5 104 13,2215 10,816 11,960 
May 106 102 11,236 10,404 10,812 
June 112 103 12,544 10,609 11,536 
July ina 106 12,321 11,236 11,766 
August 108 106 11,064 11,236 11,448 
September Il 106 T2532 11,236 11,766 
October Ito 106 12,100 11,236 11,660 
November 107 102 11,440 10,404 10,914 
December 08 96 9,604 9,216 9,408 
January 81 92 6,561 8,464. 7,452 
February . 81 fone 6,561 8,281 eae 
March . 77 90 5,920 8,100 6,930 
April 72 88 5,184 7,744 6,336 
May 69 87 4,701 7,569 6,003 
June 66 84 4,350 7,056 55544 
July 57 83 35249 6,889 4,731 
August 58 82 35364 6,724 4,756 
September 59 82 3,481 6,724 4,838 
October 62 83 3,844 6,889 5,146 
November 65 82 4,225 6,724 55330 
December 65 80 4,225 6,400 5,200 

3331 3471 322,709 | 337,283 | 326,756 


« 
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any fixed number of intervals earlier or later. Thus, the January 
item of one series may be paired with the April item of the other ; 
the February item of the first, with the May item of the second ; 
and so on. The degree of correlation between the series may 
then be measured directly, subject to this arrangement. 

By varying the timing of the two series, it is possible to deter- 
mine the time relationship which results in the maximum cor- 
respondence, or correlation, of the variables. The difference 
between this arrangement and simultaneity measures the lag of 
the series that runs. behind, or the lead of the series that runs 
ahead. The determination of lag, or lead, in time series is one 
of the most important phases of cycle analysis. 

Lag (or lead) may be determined in the first place by graphic 
means. Suppose two deviation series (expressed in deviations 
from trend corrected for seasonal variation) are reduced to units 
of their respective standard deviations ; 7.e. are stated in “cycle” 
form. By this step, differences in the amplitude of the cyclical 
fluctuation are removed from the picture. It is possible, there- 
fore, to focus attention on the timing of the cyclical movements. 
Let the cycle series be plotted on semi-transparent paper, with 
identical horizontal and vertical axes and scales. Next, super- 
impose the graphs on one another over some strong light. If the 
horizontal axes of the figures are kept coincident and the two 
plots moved to right and left along this line, the position may be 
found in which the two curves most nearly coincide. Then by 
noting the position of the plots on the one curve relative to those 
immediately underneath on the other curve, a definite idea is 
obtained of the timing of the variables. 

It is not to be thought that it is always clear just what position 
of the plots will result in the maximum correspondence of the 
curves. Sometimes, it is true, the position is unmistakably 
defined. At other times, however, the time relationship of the 
variables is obscured, or appears to differ from one part of the 
curves to another. Discrimination and good judgment in the 
examination of the data are of the utmost importance if con- 


1 A light placed under ground glass may be conveniently used. 
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clusions concerning lag and lead are to rest upon graphic com- 
parison." 

Lag (or lead) in time series is to be measured in another way ; 
namely, through the use of the correlation coefficient.? If 
correlation coefficients are worked out for two time series with 
the items paired differently in time—say first with the items 
simultaneously paired, and then with one series leading one, two, 
and three months and then lagging one, two, and three months — 
the different values of the correlation coefficient thus obtained 
are likely to form a series which will indicate definitely the time 
relationship of the two variables giving the maximum correla- 
tion. An illustration of the results of this method appears in 
Table 107 and Chart 83. 


TABLE 107. COEFFICIENTS OF CORRELATION BETWEEN CYCLES OF PIG-IRON PRo- 
DUCTION AND INTEREST RATE ON SrIxTy-TO-NINETY DAy COMMERCIAL PAPER, 
WITH 0, 3, 4, 5, 6, 7, 8, 9, AND 12 Monrtus’ Lac or INTEREST RATE 


0,0 10,3 r0, 4 10,5 ro, 0 ro, 7 ro, 8 ro, 9 ro, 12 


deen |) Sedo] se || Seas ll sro |) sree Wl sede || a 0Og |) sexs 


In this case it is evident that the maximum correlation co- 
efficient is obtained with a lag of five or six months in the interest 
rate. The method employed here is somewhat laborious but 
perfectly simple in character. It gives a quantitative measure- 
ment of average lag, which is highly valuable in the study of the 
time relationships of temporal variables. It should be re- 
membered, however, that the figure finally obtained is an average 
result; it may not accurately represent the time relationship 
existing between the paired series in certain sections of the period 
covered by the analysis.‘ Furthermore, the best available 

1 Somewhat more dependable results may be obtained by having a number of observers 


work independently and subsequently pool their conclusions. 

2 For an early use of this method see R. H. Hooker, “‘Correlation of the Marriage Rate 
with Trade,” Statistical Journal, Sept. 1901, pp. 485-402. 

3 From Review of Economic Statistics, prel. vol. I, p. 124. 

4 For disclosing differences in lead or lag during shorter periods of time, the correlation 
analysis is frequently inferior to graphic comparison of the ‘‘cycle”’ series. 
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measurement of time relationships of two series may be essentially 
ambiguous in that it leaves unsettled the fundamental question 
as to whichseries in the givenrelationship is leading, whichlagging. 
Difficulties of this sort appear in all phases of statistical analysis, 
and give rise to most serious problems of interpretation even when 
the treatment of the data has been fully in accord with the best 
statistical technique. 

CuHaArT 83. COEFFICIENTS OF CORRELATION BETWEEN CYCLES OF Pic-[RON PRO- 


DUCTION AND INTEREST RATE ON SIxTy-TO-NINETY DAy COMMERCIAL PAPER 
WITH 0, 3, 4, 5, 6, 7, 8, 9, AND 12 Monrtus’ Lac oF INTEREST RATE 


Coefficient of Correlation 


Oe 27S Ano 6. SG e” Oe ilO il (eee 
Lag in Months 


Source: Review of Economic Statistics, prel. vol. I, p. 124. 


Conclusions of great importance are sometimes to be drawn 
from a knowledge of lead and lag in time series. When there is a 
high degree of correspondence between the fluctuations of two 
series and the movements of one definitely precede those of the 
other, the first may be tentatively regarded as the cause of the 
second if this conclusion is not inconsistent with other lines of 
evidence. Whether or not a causal connection is to be postulated, 
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the persistent tendency for the movements of one series to an- 
ticipate those of another offers opportunity for the prediction of 
the movement of the second in terms of the first. It is this fact 
which is at the foundation of the technique of business forecasting 
as recently developed along statistical lines. 


ANALYSIS OF GROUPED VARIABLES: 
INDEX eNUSI BERS 


CHAPTER XXI 


THE NATURE AND PURPOSE OF INDEX NUMBERS 


Amonc the most important instruments of statistical analysis 
is the index number.!' An index number is a number designed to 
express the relative change or difference of a group of related 
variables. Suppose the prices of 150 articles in 1925 are to be 
compared with the prices of the same articles in 1910. Some 
device is needed for measuring the change of the prices of the 
150 articles as a whole. An index number serves precisely this 
purpose. It is designed to register what is essentially a com- 
posite difference. Mathematically, an index number may rest 
upon either an aggregate or an average. In either case it serves 
to express differences in an entire group of variables, instead of 
differences in any one of the individual items composing the group.? 

The relative difference registered in an index number may 
relate to a difference either in time or in space. Thus, an index 
number may be set up to measure change in the cost of living in 
the United States from January 1, 1915, to January 1, 1925; 

1The most valuable comprehensive treatments of the subject of index numbers are to be 
found in Irving Fisher, The Making of Index Numbers; Wesley C. Mitchell, ‘‘ Index Num- 
bers of Wholesale Prices in the United States and Foreign Countries,” Bulletin of the United 
States Bureau of Labor Statistics, No. 284, Oct. 1921; and C. M. Walsh, The Problem of 
Estimation. Reference may also be made to two articles by A. A. Young: ‘“‘ The Measure- 
ment of Changes of the General Price Level,’ Quarterly Journal of Economics, vol. XXXV 
(Aug. 1921), pp. 557-573; and “Fisher’s The Making of Index Numbers,” ibid., vol. 
XXXVII (Feb. 1923), pp. 342-364. 

2 Index numbers are not to be confused with mere relatives. If the individual items of a 
simple time series are to be related to some particular point, or base, and the items are con- 
sequently converted into percentage relatives of the base item, the percentage figures thus 
obtained are preferably referred to as relatives — price relatives if the original items are of 
prices, production relatives if the original items are of production, and soon. Relatives of 


this sort are sometimes called index numbers, but it is better to give the latter term the 
distinctive meaning indicated above. 
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but it is equally possible to construct another index to measure 
difference in the cost of living on January 1, 1925, in New York 
City and Washington, D. C. Index numbers designed to record 
movements in time are the more common type; but relative 
numbers, expressing spatial differences, may be just as truly 
index numbers. ‘The one essential feature of the index number is 
that it expresses the relative difference of a group of variables. 

Thus conceived, the index number has a distinct and important 
function in statistical analysis. In a large number of investiga- 
tions, the variables to be analyzed are definitely grouped. Study 
of such associated variables commonly shows that there is a sig- 
nificance in the movement of the group which is distinct from the 
significance of the movements of its constituents, and that only 
through the consideration of the group as a whole is the character 
and meaning of the group movement to be ascertained. It is in 
this connection that the use of index numbers is indicated. To 
quote Bowley (Elements of Statistics, p. 196), the purpose of index 
numbers is “‘to measure the change in some quantity which we 
cannot observe directly, which we know to have a definite influ- 
ence on many other quantities which we can so observe, tending 
to increase all, or diminish all, while this influence is concealed 
by the action of many causes affecting the separate quantities 
in various ways.” Through the use of index numbers, we gain 
knowledge of the action of the more general forces which appear 
to influence groups of individual variables. 

The general nature of the index number will become clearer if a 
concrete illustrationis considered. Examine the data of Table 108. 
The original figures appearing in columns (a) to (e) of this table 
give the price records of a group of five cereals for the period 
1910-1920. ‘The prices of all five of the cereals changed radically 
during this period, as may be seen clearly in both the original 
averages and in the relatives appearing in columns (f) to (7). 
The question is: what may be said of the price changes of the 
group of five as a whole? The index shown in column (&) affords 
a definite quantitative answer to this question. Thus, it shows 
that for the group as a whole, prices were 17.6 per cent higher in 
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1914 than in 1910; 68.4 per cent higher in 1916; 118.5 per cent 
higher in 1918, and only 48.5 per centin 1920. The price of no 
single cereal followed exactly this course. The series of index 
numbers serves to register the group movement as distinct from 
the movements of the individuals composing the group. 


TABLE 108. AVERAGE FARM PRICES AND INDEX NUMBERS OF AVERAGE 
Farm PRICES FOR FIVE CEREALS IN THE UNITED STATES, 1910-1920 


AVERAGE Farm Price ON Dec. 1 |AVERAGE FARM PRICE RELATIVE TO 1910 inp 
(in cents per bushel) (1910 price = 100) Noe 


- oF Prices* 
(1910= 100) 


YEAR 


CORN | OATS | WHEAT |BARLEY| RYE CORN OATS WHEAT|BARLEY| RYE 


(a) | CONES (eae Cae Key 'Cl) Si Cay |e) Hea) ge) (R) 
I9gIo| 48.0] 34.4} 88.3] 57.8] 71.5 | 100.0 | 100.0] 100.0] 100.0| 100.0} 100.0 
IQII| 61.8] 45.0] 87.4] 86.9] 83.2|128.7|130.8] 99.0] 150.3|116.4] 125.1 
to12| 48.7|31.9| 76.0] 50.5| 66.3) 101.5] 92.7} 86.1] 87.3] 92.7 92.1 
1913| 69.1] 39.2] 79.9] 53-7] 63.4|143.9|113.9] 90.5] 92.9] 88.7] 106.0 
T914| 64.4] 43.8] 98.6] 54.3] 86.5 |134.1 | 127.3] 111.7] 93.9|121.0| 117.6 
IQI5} 57-5|36.1| 91.0} 51.6] 83.4] 119.8] 104.9] 104.1] 89.3|102.7] 104.2 
1916| 88.9] 52.4| 160.3] 88.1] 122.1 | 185.2 | 152.3 | 181.5 |152.4|170.8| 168.4 
1917 | 127.9 | 66.6 | 200.8 | 113.7 | 166.0 | 266.5 | 193.6 | 226.3 | 196.7 | 232.2] 223.2 
1918 | 136.5 | 70.9 | 204.2} 91.7] 151.6 | 284.4 | 206.1 | 231.3]158.6| 212.0] 218.5 
IQIQ | 134.7 | 71.5 | 215.1 | 121.0] 134.5 | 280.6 | 207.8 | 243.6 | 209.3} 188.1] 225.9 
1920] 67.7|47.2|144.3| 70.7|127.8] 141.0] 137.2] 163.4 |122.3/178.7] 148.5 


* Obtained as a ratio of aggregates calculated by multiplying the prices of the given year and of 
1910 by the quantities of 1910. 


Index numbers have had their most general application in the 
analysis of just such movements; that is, movements of prices. 
The individual commodity price is affected by a multiplicity of 
causes: seasonal and climatic influences; improvements in the 
technique of production; reductions of cost; changes of fashion ; 
legislation — such as new tariff enactments; war, with its 
radical changes of production and consumption; the develop- 
ment of substitutes; and so on. Playing through all of these, 
however, there is at times the general influence of monetary 
inflation or deflation. An index number permits a measure- 
ment of the extent to which this pervasive monetary influence is 
affecting the course of prices. Once obtained, the index number 
may be employed for such purposes as readjusting rates of wages, 
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comparing real incomes at different times and places, fixing the 
value of deferred payments, or regulating the supply of money 
and credit. Measurement of price movements is doubtless the 
most important single service of the index number. 

But the use of index numbers is by no means confined to the 
measurement of price changes. A few concrete examples will 
serve to show how extensive are the fields in economics and 
business in which index numbers are now effectively employed. 

TABLE 109. MontHty INDICES OF THE AVERAGE Cost oF LIVING IN 
THE UNITED STATES 


(Base month of 1914 = 100) 


YEAR Monte Aut ITEMs Foop SHELTER | CLOTHING Be SUNDRIES 
TOE4| T MO; 2 100 se\e) 100 100 100 100 
TOUS |r MOy se IOI 100 100 103 102 100 
TQM | Te SOO Be 109 III 102 120 104 104 
TON /|P BONN 5 131 146 105 143 126 027) 
1918 | December 159 173 118 185 138 152 
IgIg| av. 3 mo.. 172 186 129 205 144 164 
1920|mo.av. . 198 205 154 261 168 185 
I92I| mo. av. . 167 156 169 166 183 184 
TO 220 IMO Vee: 157 142 166 155 179 We 
TO 235) Os ave 161 146 173 170 180 173 
1923| January . 158 144 167 160 187 171 
February . 158 142 167 162 187 171 
March. . 159 142 170 168 186 173 
Noval 4 159 143 170 167 180 173 
Mays 160 143 72 174 178 173 
{feb oe 160 144 172 169 178 173 
ipellye 5 162 147 175 170 176 rg 
August . 162 146 175 171 176 173 
September 163 149 175 175 176 173 
October . 164 150 175 176 178 172 
November 165 I51 180 174 176 174 
December 165 150 180 175 176 174 


Source: Bulletins of the National Industrial Conference Board. 


A type of index closely related to the index of general prices is 
one set up to measure the average cost of living in the United 
States. One of the best examples of this type is the cost-of-living 
index calculated for the United States by the National Industrial 
Conference Board. The index is given both for a number of 
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groups of articles and for all items combined. It is published 
for each month as a percentage of the base month in 1914. The 
index is shown in condensed form in Table 109. 

An index drawn from an entirely different field is shown in 
Table 11o. This index — devised by Professor W. A. Berridge — 
registers fluctuations from month to month in the numbers of 
workers on pay-rolls of manufacturing establishments in the 
United States. 


TABLE 110. Montuty INDEX OF EMPLOYMENT IN THIRTEEN GROUPS OF 
INDUSTRIES IN THE UNITED STATES, 1919-1923 


(Average for 1919 = 100) 


Montu 1919 | 1920 | 1921 | 1922 /1923 
January. . .| tro1 106 74 84 04 
February . . 95 106 80 84 95 
Marcha yes 96 107 80 80 96 
ANON A 06 108 80 78 95 
MGV as eae ats 98 108 81 80 
Wing 6 6 6 6 || Hex 108 82 82 
Ab Ste a ell te 100 82 81 
ASUSUSE en el eal OL 100 81 81 
September . .| 103 97 83 84 
October er 99 04 85 88 
November . .| 102 89 86 90 
December . .]| 105 83 86 92 


Source: Review of Economic Statistics, prel. vol. IV, p. 50; V, p. 170. 


Still another interesting example is shown in the material of 
Table 111. In this case the monthly index registers the physical 
volume of production of selected basic materials. The index is 
compiled regularly by the Federal Reserve Board. It differs 
from the indices previously shown in that the data have been 
corrected for seasonal variation. 

Quite a different type of index is represented in the data shown 
in Table 111. This is the monthly index of the volume of trade 
devised by the New York Federal Reserve Bank. Both trend 
and seasonal variation have been eliminated from the records 
so that the fluctuations of the index are of a cyclical order. 
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TABLE 111. Montuty INDEX oF PRopUCTION IN Basic INDUSTRIES 
IN THE UNITED STATES, 1914-1924 


(Monthly average, 1919 = 100) 


1914 | 1915 | 1916 | 1917 | 1918 | 1919 | 1920 | 1921 1922 | 1923 | 1924 
January . .| or 79 | 113 | 119 | 104] 108 | 116 | 84 tesin || Gupte |) aaexe; 
February .| 94 Say LEME Lid selO7mleTOOsaruS ie .o5 Qt | 120 | 120 
Marcha nO5 tehet || seni | resend) OOME Sm lmeou OS || Tas |) seo 
JNoyall Sg gil CH (A || ae || aeeyoy | seals 99 | 108 | 79 SS i ae, |) ari 
INN i as Rie) ie | arin |i aeyey | Gee OsmieLOOmmey 7 O2) \ire7 leno? 
iene. 4 ail oe OOwecLse | DLO erro OA || wey Ah 94 | 122 04. 
lilies oo Ble OS) |PLtOn|SLE2 ee Tron | srO2. | ron ||) 75 OVS || eat 04 
August . .| 84 coy |} eaaned || sure) ecisgy || Soy | ado}! ob fo OYA || tee |) os 
September =. | Sm ros | re4y run | 012) |\)ro5 | 102)" %o" | reo, |) 1140 Tog 
October] | s2mieros |) tae rr) Toon ior OON Os On LES sl enOO 
INovember e\-75, | Clon |es2reenTON| TOON OS 4 O5a)) SOM T1ON lento lTor 
December .| 78 | 118 | 117 | 112 | 108 | 104 | 90 | fey i susan) sean || usey) 


Source: Federal Reserve Bulletins. 


TABLE 112. MONTHLY INDEX OF THE VOLUME OF TRADE IN THE 
UNITED STATES, 1919-1923 


(Normai = 100) 


Monts 1919 1920 1921 1922 1923 
JOON 6 ee oe 97 109 90 93 109 
inGeaieiay 5 @ dag 6 07 104 92 06 110 
March a eae ae 95 108 QL 102 113 
/Wofall Ae eee ae ie | eee IOI 105 92 99 109 
MiaViee Mysto ach chews es 106 104 on 103 IIo 
AAUNIVGD Soc ah 6, Sore tr when en 108 102 Q2 103 106 
Jtilygrea eh. Me eee ey rete 110 IO QI 07 ~~ 
AUSUSER, Mia Web. Mas 107 99 05 99 — 
SMe g moo o Oo 6 107 96 04 103 — 
October! wee) ea eee 109 95 92 103 — 
INCOMING SB ns 5 6 o 105 04 92 104 -— 
IDYATS NNT? 5 5 6 a a 5 105 03 03 105 — 


Source: Carl Snyder, ‘‘A New Index of the Volume of Trade,” Journal of the Ameri- 
can Statistical Association, Dec. 1923, table opposite p. 962. 
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As these illustrations suggest, index numbers may be developed 
whenever group movements are to be measured. Wages, imports 
and exports, employment, building, manufacturing, trading — 
these and a host of other elements of economic life offer oppor- 
tunities for index-number construction. In general, index num- 
bers are an important device whenever the individuals of a 
larger mass are affected by general influences which cannot be 
directly observed, but must be detected in the changes which 
occur in the mass as a whole. 

The problems involved in the construction of index numbers 
merit most careful consideration. A number of separable 
processes are to be distinguished. These may be briefly desig- 
nated as follows: 


1. Definition of the purpose of the index. 
2. Selection of the data to be employed in constructing the 


4 index. 


3. Determination of the relative importance of the several 
constituent variables represented in the data, and of the 
manner in which the relative importance of these variables 
is to be taken into account in the construction of the index. 


4. Determination of the point of reference, or base, to which 
differences in the group of variables are to be referred. 


5. Selection of the type of aggregate or average through which 
the movement of the group is to be expressed. 


Though these processes are in some measure interwoven, they 
may be profitably examined separately. Some require much 
more extended study than others. The first two, and certain 
parts of the third, are covered in the present chapter. Con- 
sideration of other phases of index-number construction — phases 
of a more technical nature — will constitute the subject matter 
of the two chapters that follow. 

The first point to be settled in the construction of an index 
number is the purpose to be served. There must be careful 
definition of the factor to be measured by the index number. 
There is considerable controversy regarding the extent to which 
the purpose affects certain phases of index-number construction 
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— such as the selection of the type of average to be employed — 
but there is no disagreement at all as to the bearing of the purpose 
of the index upon the selection of data, and the weighting of the 
constituents in the actual computation of the index. Even if 
it be supposed that the purpose of the index affects the analysis 
only in these particulars, it is of the utmost importance that the 
purpose be carefully defined at the outset. 

The importance of first determining the purpose of an index 
number may be illustrated by considering the problem of con- 
structing a cost-of-living index. Suppose that a measurement of 
food costs is to be undertaken for the years 1913 and 1923, and 
that the available data cover completely both quantities con- 
sumed and prices paid for all the important food commodities. 
On the basis of these records, at least two distinct index numbers 
may beset up. In the first place, an index number may be con- 
structed to measure the change in the cost of a fixed schedule of 
consumption, — an index number, in other words, based on prices 
as given for 1913 and 1920, but with the quantities consumed 
fixed in some way, either in terms of some hypothetical or stand- 
ard set of quantities, or in terms of the quantities of 1913, or of 
1920, or of some combination of the two. In such an index, the 
only variables are the variable prices. The index consequently 
registers solely the difference occasioned by price changes within 
the limits of a standard, or at least a fixed, budget. It represents 
the variable cost of a constant bill of goods. In contrast to such 
an index, one may be developed to register the actual change in 
expenditure caused by changes in both prices and quantities. 
An index of this sort is to be described as an index of actual living 
costs, rather than as an index of the cost of a fixed or standardized 
budget. Each of these two types of index number has its place. 
Which one is to be employed depends altogether upon the object 
of inquiry. The purpose to be served by the index obviously 
has an important bearing upon the methods to be employed 
in its construction. 

In some measure the nature of the problem involved in deter- 
mining the purpose of an index number cannot be fully explained 
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until the general technique of index-number construction has been 
covered. It is enough to urge at this point that the problems of 
method are conditioned in considerable measure by the purpose to 
be served by the index. Consequently, it is desirable at the out- 
set of the analysis to define this purpose as carefully as possible.’ 

The selection of the best possible data is one of the most 
important steps in the construction of an index number. In the 
first place, it is highly important that the data should be logically 
consistent throughout. Thus, in the case of price series, great 
care must be exercised to see that the prices quoted for different 
places and different times are prices of the same kind and quality 
of goods. Uniformity in this particular is not easily attained, 
but it is absolutely essential to the subsequent analysis. As a 
rule, series which are seriously defective in respect to the defini- 
tion of units or as to homogeneity are to be promptly discarded. 

The range of the available data is another factor having an 
important bearing upon the composition of the index number. 
Not infrequently, relatively little choice is possible in the selec- 
tion of data owing to the lack of satisfactory material. If the 
available records are scant, it may be that practically all that 
are in reasonably satisfactory shape will have to be utilized. 
In general, nothing more seriously hampers index-number con- 
struction than the paucity of dependable records. 

A third consideration in the selection of material is the repre- 
sentativeness of the data. It is at this point that the purpose 
of the index number must be most carefully considered. What 
is representative for an index number of one purpose is not at 
all representative for an index number of a different purpose. 
Adequate representativeness is assured, of course, if the data are 
essentially complete. For example, in working out an index of 


1 At times index numbers are constructed with no specific purpose. They are designed 
to measure movements which may be important in a number of different connections. Index 
numbers of this sort are sometimes called general-purpose indexes. Since they lack specific 
purpose, it is more difficult to make them harmonize with any single purpose ; in fact, general- 
purpose indexes may be so much a compromise of diverse purposes as to be of doubtful 
validity. However, the very fact that general-purpose indexes are intended to serve a 
number of purposes has a bearing upon the methods to be employed in their construction. 
In other words, the purpose of the index, though not as specific as in other cases, still con- 
ditions the methods adopted. 
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agricultural production in the United States, it may be possible 
to secure records on practically all the more important agricul- 
tural products. If from eighty-five to ninety per cent of all 
constituent series are covered in authentic statistical records, an 
index number may be constructed which will be dependable in 
the sense that it covers the great bulk of the group of variables 
the movement of which the index is designed to register. But 
if an index has to be based upon only thirty-five or forty per cent 
of the total group, the significance of fluctuations in the index 
becomes much more uncertain. The remaining sixty-five per 
cent might tell a different story. An index number resting upon 
only thirty or forty per cent of the group must therefore be used 
with much more caution and with definite warnings not to take 
too much on faith. The more nearly the index can be made to 
cover one hundred per cent of the items of the group the more 
nearly representative it is from this particular point of view. 

Commonly, full representativeness in this sense is out of the 
question. Thus, it is inconceivable that a complete record of all 
price changes be included in an index number. Not only are 
trustworthy price records not available, but the computations 
involved in a complete index number would be so laborious as to 
be prohibitive. Furthermore, careful selection seems to satisfy 
all the requirements of accurate measurement. Here, as else- 
where in statistical analysis, the principle of sampling, if properly 
applied, isinvaluable. It should be noted, however, that reliable 
sampling, as ordinarily conceived, depends upon randomness, 
and that randomness is hardly a feasible condition in the 
assembling of the data used in many index numbers. When 
sampling is resorted to, the collection of data should be made 
with the utmost care, that the conditions of perfect sampling 
may be approximated as closely as possible. Commonly, how- 
ever, specific selection, rather than random sampling, must 
govern the assembling of data for the index number. 

In considering the question of representativeness, one is con- 
fronted.almost invariably by the fact that the total group to be 
covered by the index number is made up of smaller groups, each 
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one with its characteristic movement. Under this condition the 
question of representativeness needs to be considered with refer- 
ence to each one of the constituent groups. Some of the groups 
may be covered by the available data much more satisfactorily 
than others. Under these circumstances, the index number may 
be constructed by developing a number of group indexes which 
are later combined to obtain the index number for the several 
groups as a whole. This process raises questions of weighting 
which have to be carefully considered. In general, the subject 
of selecting data with reference to the representative character 
of the constituent series is one of the most important questions 
in the formation of an index number of any kind. 

The problem of weighting is another of the problems which is 
fundamental in index-number construction. Since the available 
data are complete only infrequently, it is necessary to determine 
carefully the relative importance of the several constituent series 
actually to be incorporated in the index. This commonly in- 
volves the calculation of general factors which are to be applied 
to the constituent series in computing the general index number. 
These factors should be proportionate, of course, to the relative 
importance of the several groups which the constituent series 
are designed to represent. To refer to a contrast frequently 
cited, it would be absurd to allow price changes of pepper and 
sugar to count equally in the construction of an index number 
designed to show changes in the cost of living. In the computa- 
tion of an index number for this purpose, factors need to be 
applied which will give price changes of sugar a much larger 
influence than price changes of pepper. Weighting here is 
indispensable. 

Concerning the importance of weighting in index-number con- 
struction, there has been a great deal of discussion, and no little 
confusion. Some of this confusion has arisen in connection 
with a much-quoted statement by Bowley: ‘Given certain 
conditions, the same result is obtained with sufficient closeness 
whatever logical system of weights is applied.”! Curiously 

1 Elements of Statistics, p. 87. 
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enough, this statement has been taken by some to mean that 
weighting makes no difference in index numbers. As a matter 
of fact, it may make a great deal of difference. Bowley’s argu- 
ment is not against weighting: he states definitely that weights 
cannot in general be neglected without first trying various 
weighting systems to see that the resulting average is stable. 
True, if a large number of series are included in the index number, 
and there is no necessary connection between the changes of the 
series and their relative importance, weights may be unnecessary. 
Even under this condition, however, they may be wisely employed ; 
since they give an index number the appearance of greater 
reasonableness, a weighted index number undoubtedly com- 
mands more respect. Under most conditions, and in particular 
when the number of constituent series is relatively small, the use 
of weights cannot be regarded as in any sense optional. 

As a matter of fact, almost every index number is weighted in 
one way or another. If one type of fluctuation is represented in 
the index number by a larger number of series than another type 
of fluctuation, there is concealed weighting in the construction 
of the index. Such concealed weighting is frequently present in 
index numbers which are stated to be unweighted. Furthermore, 
such concealed weighting is almost invariably bad weighting. 
Something akin to illogical weighting may result, in fact, even if 
there is only one series representing any particular commodity, 
for the different commodities may not be all of the same impor- 
tance despite the fact that they have been equally represented 
in the construction of the index. As Mitchell says,’ ‘the real 
problem for the maker of index numbers is whether he shall leave 
weighting to chance or seek to rationalize it.” In general, 
weighting is to be regarded as a highly important step in the 
construction of most index numbers. It is true, as Bowley states, 
that it does not pay to “strain after exactness in weighting,” ? 
but some logical system of weighting is necessary unless it may be 


1 “Index Numbers of Wholesale Prices in the United States and Foreign Countries,” 
Bureau of Labor Statistics, Bulletin 173, p. 72. 
2 Elements of Statistics, p. 94. 
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safely assumed that the divergent movements within the group of 
related variables so neutralize one another that their net effect is 
to register the movement which is under investigation. 

The whole problem of weighting is one which will recur in 
connection with the discussion of the later phases of the technique 
of index-number construction. It is not a subject which may 
be safely neglected, as is sometimes asserted. On the contrary 
it is one of the most important problems in the making of index 
numbers. 

Index numbers should be stated finally in relative form ; that is, 
in a form which refers the magnitudes of one time or place to those 
of some other time or place, used as a point of reference or com- 
parison.t Selection of the base point or base period is therefore 
‘one step in the construction of the index number. Certain 
phases of the problem of selecting the base period are best deferred 
until later in the discussion; but it should be noted that, in 
general, base points or base periods should represent normal 
conditions — normal, that is, in the sense of being at neither 
extreme of variation. Base points or periods normal in this 
sense may be had by striking an average of values represented in 
the area or period covered by the index numbers, or by selecting 
some specific point or period which is known not to be extreme 
from the point of view of the investigation in hand. The im- 
portance of having such a normal basing point lies largely in the 
assurance it gives of no grossly misleading impressions from the 
index number. But the selection of the base period as a rule does 
not involve as serious problems in the making of index numbers 
as do the other steps which have to be taken in index-number 
construction. In fact, some index numbers are published with 
no reference to any basing point whatsoever. 

The derivation and application of weighting systems, the 
selection of base points or periods, and the determination of the 
mathematical process by which to obtain index numbers are 


1 A year which is used as a basis of reference is called a base year; any other year, a given 
year: The difference between the two may be indicated in symbols by giving the base year 
in a subscript before the index, and the given year in a subscript after the index; e.g. 91, 
where o refers to the base year, and 1 to the given year. 
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problems of a distinctly technical character. Recent discussions 
have thrown a great deal of light upon these problems and have 
provided solutions which now possess a degree of authority which 
is unusual in statistical technique. These will be considered in 
the two chapters that follow. 


CHAPTER XXII 


UNWEIGHTED INDEX NUMBERS 


One of the most fundamental questions encountered in the 
construction of an index number concerns the relative importance 
to be ascribed to the several individuals of the group in the de- 
termination of the change of the group as a whole. Sometimes 
the individual members of the group may all be regarded as of 
equal importance. The influence of no one of them is then to be 
multiplied in measuring the change of the group: the index is to 
be set up in unweighted form. At other times, some members of 
the group are to be regarded as more representative than others. 
The changes of these have then to be given correspondingly 
greater effect in the calculation of the change of the total group; 
the index, in other words, must be weighted. The distinction 
between weighted and unweighted forms is fundamental in the 
making of index numbers. Unweighted forms will be considered 
in the present chapter; weighted, in. the next. 

The problem of setting up in proper statistical form an un- 
weighted index number may be introduced by the consideration 
of a simple concrete case. Suppose that there is interest in the 
subject of loss of weight during hard contests by football players. 
A record is taken of the weight of each member of a particular 
team before and after an important intercollegiate game. The 
data for eight men playing through the entire contest appear in 
Table 113. The general object of inquiry may first be considered 
by putting a simple question: to what extent did the weights of 
these eight players undergo change relatively during the game? 

Measurement of the relative change in weight of the group of 
eight players as a whole may be obtained in at least three distinct 
figures. First, the relative change of the group may be stated 
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in terms of the average relative change of the individuals of the 
group. In column (c) of the table, the weights of the players 
after the game are stated as relatives of their weights before the 
game. The median of these relatives is 97. This figure ex- 
presses the characteristic change of weight in the group of eight 
players as a whole. An index number, then, may take the form 


of an average of relalines. 


TABLE 113. WEIGHTS OF MEMBERS OF FOOTBALL TEAM BEFORE 
AND AFTER IMPORTANT CONTEST 


WEIGHT (in pounds) 


PLAYER WEIGHT AFTER 
BEFORE GAME AFTER GAME eee ace ee 

GAME 

(a) (b) (c) 

Aue 172 167 97 
B. 187 180 96 
(Ge 198 192 97 
iD). 189 183 07 
E. 210 201 96 
1 182 178 98 
G. 165 160 07 
inl 166 163 98 


Second, the relative change of the group may be found in the 
ratio of the average weight of the players before the game to their 
average weight after the game. In the data given in Table 113 
the average (arithmetic mean) weight of the men before the game 
is 183.6 pounds; after the game, 178 pounds. If the former 
figure is taken to be 100, the latteris 97. Here again, we have 
an index number of the weight of the group. In this case, it is 
essentially a ratio of averages. 

Third, the relative change of the group may be ascertained from 
the ratio of the total weight of the players before the game to 
their total weight after the game. The aggregate weight of the 
men before the game was 1469 pounds; after the game, 1424. 
These two totals stand in the ratio of 100 to 97. The index 


number here is found as a ratzo of aggregates. 


1 Index numbers derived from aggregates may be called, after Professor Fisher, “aggrega- 
tive” index numbers. 
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Every index number takes one of these three fundamental 
forms. Under certain conditions, the three forms result in 
equivalent index numbers. The matter of form is then im- 
material. Under most conditions, however, differences in form 
result in differences in magnitude in the index. It becomes 
highly important consequently to examine all three forms in 
order that the special characteristics of each may be definitely 
recognized. 

Index numbers in the form of averages of relatives have been 
very widely employed. They possess an appearance of simple 
reasonableness which at first sight is quite convincing. What 
could be more obviously a satisfactory measure of the relative 
change of a group of variables than an average of the relative 
changes of the individuals composing the group? But the matter 
is not as simple as this line of reasoning would imply. That this 
is the case will appear if the nature of an average of relatives is 
more closely examined. 

Let the character of the zmdividual relatives first be considered. 
The most important fact about these is that they state each 
difference proportionately, not absolutely. The same relative 
is obtained for the football player weighing 200 pounds who 
loses four pounds, and the player weighing 150 pounds who loses 
three. Each difference is thus specifically related to the size of 
the individual exhibiting the difference. All members of the 
group, whatever their absolute differences of size, are re- 
duced by this means to a common level. From the point of 
view of proportionate change, the relatives are strictly com- 
parable items. They may appropriately be averaged provided 
changes among the individuals of the group are to be dealt with 
_ relatively. 

But changes among the members of a group are not always to 
be thought of proportionately. As will later appear, conditions 
sometimes indicate that differences are to be treated absolutely. 
One’ of the most important questions in index-number construc- 
tion is: are the changes of the individuals of the group governed 


1 Other forms are conceivable but not of practical significance. 
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by a law of proportionality? For example, in the case of the 
football players, may the man weighing 200 pounds be thought to 
lose his four pounds as easily as a man weighing 150 loses his 
three? If no such rule of proportionality can be thought to 
prevail, the development of relatives cannot be said to be posi- 
tively indicated. 

Even if individual relatives appear to be appropriate, the 
process of averaging the relatives involves further problems. 
In the first place, it is to be noted that the simple (z.e. un- 
weighted) arithmetic and harmonic means are not applicable. 
This follows from the fact that in the averaging of relatives both 
of these means have a definite bias — that is, a persistent error 
of definite sign.? 

The existence of this bias is easily demonstrated through the 
use of the base-reversal test.2 This test consists in (1) obtaining 
an average of the relatives on the one date (or place) as a base; 
(2) obtaining another average of the relatives on the other date 
(or place) as a base; and (3) multiplying these two averages 
together. The two averages should be reciprocals of one another 
since they are nothing but end-to-end views of one set of changes. 
The product of the averages, therefore, should equal unity.® 
When the product for any given variety of average consistently 
exceeds (or falls short of) unity, that form of average may be 
said to display bias. 

When the base-reversal test is applied to a simple arithmetic 
mean of relatives, a definite upward bias is discovered. Consider 
the data given in Table 114. 

The arithmetic mean of the price relatives of 1920 on the 
base of 1913 is 269.6 per cent; the arithmetic mean of the 
relatives of 1913 on the base of 1920 is 46.1 per cent; the prod- 
uct of the two is 124.3 per cent—or 1.24. Upward bias of — 
the sort disclosed in this example is always present in a simple 


1 Professor Fisher refers to such bias, arising from the form of the average, as type bias. 

2 This is commonly called, after Professor Fisher, the time-reversal test. _ It is, however, 
just as relevant to different points in space as to different points in time. The designation 
used in the text would seem, therefore, to be preferable. 

3 In symbolic terms, the base reversal test is met only when of) X wo = I. 
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arithmetic mean of relatives referring to different points in time 
or space.' 


TABLE 114. ANNUAL AVERAGE PRICES AND PRICE RELATIVES OF GROUP OF 
NINE COMMODITIES, 1913 AND 1920 


| AVERAGE PRICE PRICE RELATIVE 
ARTICLE UNIT iene isa 
1913 1920 (1913 = 100) (1920 = I00) 

Ribiroastime sss lb. .198 348 Gey 56.9 
Hampden lb. .209 SS TET 214.5 46.7 
Wa by cara alk doz. 345 536 155.4 64.3 
BUttCK Aen ee lb. £383 672 175.4 Si) 
Mg rs oe) oe qt. .089 .162 182.0 55.0 
Rlours. 2.3) lb. 033 .088 266.7 37-5 
Potatoes) + Ib. .O17 .103 605.9 16.5 
Sugahee eee lb. 055 .267 485.5 20.6 
Coftce sea) lb. .298 492 165.1 60.5 


A simple harmonic mean of relatives, on the other hand, always 
exhibits a downward bias. Let the base-reversal test be applied 
to a harmonic mean of the price relatives of Table 114. The 
product of the two means — one of relatives on the base of 1913, 
the other of relatives on the base of 1920 —is 80.6%, or .81. 
The bias of the harmonic mean is thus downward and of the same 
magnitude as the upward bias of the arithmetic. The type 


1 Type bias sometimes leads to absurd results. Consider the following example: 


TABLE 115. ILLUSTRATION oF Bras oF ARITHMETIC MEAN 
oF PricE RELATIVES 


AVERAGE PRICE PRICE RELATIVE 
ComMoDITY 
1910 1920 1920 on base of 1910] 1910 on base of 1920 

A $ .50 $ .40 80 125 

B 225 .50 200 50 

GC 1.00 60 60 167 
3)340 3)342 

113 1I4 


There appears in this case to have been an increase of prices from whichever end the move- 
ment is regarded ! 
2 Tt follows that a product of the two is without bias. 
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bias of both of these means renders them individually inapplicable 
to the simple averaging of relatives. 

Of the common calculated means of relatives, the geometric 
alone meets perfectly the base-reversal test. If a geometric 
mean is taken of the price relatives of Table 114, the figure for 
1920 on the base of 1913 is 237.7 per cent; for the earlier year 
on the base of the later, 42.1 per cent. The product of these two 
is unity. The geometric mean gives results devoid of bias what- 
ever the base upon which the relatives have been computed. In 
other words, an index number obtained as a geometric mean of 
relatives is undisturbed by changes of base point or period. It 
may be shifted from one base to another by dividing the index 
through by its value at the point or period which is to serve as 
the new base. The fact that the geometric mean is thus per- 
fectly clear of type-bias and freely reversible, gives it peculiar 
merit in index-number construction. 

The geometric mean, however, is not the only average of rela- 
tives to fulfill the base-reversal test : the two means of position — 
the median and the mode — meet the base-reversal test just as 
completely. This follows inevitably from the fact that both of 
these means relate to relatives of a single item of the group. Thus, 
to refer again to the data of Table 114, the median relative for 
1920 on the base of 1913 is 182.0 per cent; for 1913 on the base 
of 1920 is 55.0 per cent. Since the median relatives have to 
do with a single member of the group — namely, the one exhibit- 
ing the median proportionate change — and merely express in 
percentage form the relationship between the actual prices of 
this single commodity at the two different dates, they are bound 
to be reciprocals of one another. The same line of analysis holds 
for the mode. In the simple averaging of relatives, both 
median and mode fulfill the base-reversal test perfectly. 

But freedom from type bias is not the only ground upon which 
some means are to be preferred to others in the averaging of 
relatives. In an earlier connection it was pointed out that fre- 


1[t should be noted, however, that two or more biased averages may be so combined as 
to obtain an average without bias. See previous note, 
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quency distributions of relatives sometimes skew. This follows 
in part from the very nature of certain relatives — noticeably 
extreme relatives —in part, from the fact that such relatives 
under ordinary conditions cannot run more than 100 points below 
the base — in other words, cannot become less than o — but may 
run any number of points above. A tendency to skew is likely 
to be especially marked when relatives are computed for a period 
which is remote from the period used as a base. In some fre- 
quency distributions of price relatives, asymmetry is one of the 
most striking features of the record. This asymmetry has a 
definite bearing upon the selection of an average to typify the 
relatives. 

In general, asymmetry in a distribution suggests the use of the 
geometric mean or some average of position. Since the mode 
rarely is well enough defined to be serviceable, choice may be said 
to lie between the geometric mean and the median. Both of 
these means have their advocates... The median has the advan- 
tage of being easily ascertained. It is undisturbed by extreme 
items. It may be effectively used in graphic representations of 
the dispersion of the relatives.2 The geometric mean has the 
advantage of being undisturbed by moderate asymmetry in the 
distribution of the items. It has definite mathematical character 
and lends itself readily to further mathematical treatment. Fur- 
thermore, it is used with especial appropriateness in the averaging 
of relatives since by very nature it treats differences as equivalent 
when they are relatively, not absolutely, so. Altogether, the 
geometric mean would seem to possess somewhat greater virtues 
than the median. Both, however, are greatly to be preferred to 
any other means in the simple averaging of relatives. 

As previously indicated, an average of relatives is only one of 
the fundamental forms of index number. The ratio of averages 
isanother. In the measurement of change of weight in the group 
of eight football players this appeared to be an extremely clear 


1The median is recommended by Professor Edgeworth, and has been employed in im- 
portant studies by Professor Mitchell. 

2Tt is also said by some authorities to be exceptionally free from fluctuations due to 
sampling. 


UNWEIGHTED INDEX NUMBERS 349 


and simple form. Despite this fact, it is a form which need not 
be given further attention. This follows from two facts: (1) ra- 
tios of averages other than arithmetic means are of no con- 
siderable importance in index-number construction; (2) ratios 
of averages calculated as arithmetic means are identical with 
ratios of aggregates.1_ For the purposes of the present study, 
therefore, the consideration of index numbers in the form of 
ratios of averages may be merged in a comprehensive examina- 
tion of index numbers in the form of ratios of aggregates. 

The most significant difference between the ratio of aggre- 
gates and the average of relatives lies in the fact that the ratio 
of aggregates treats individual differences absolutely. Consider 
the simple data of Table 116. 


TABLE 116. AVERAGE PRICES AND PRICE RELATIVES OF THREE COMMON 
ARTICLES OF CONSUMPTION, YEARS A AND B 


PRICE 
RELATIVE OF PRICE IN B 
ARTICLE UNIT SO bse ae A 
Year A Year B 
Suga eat nen Le ees eon) (LD. $ .10 $ .05 50 
Vink Sree eae eerie et ACL. ers pial I0O 
Iie og IG ee ee oll Mslele 5.00 10.00 200 


The geometric mean of the three price relatives for the year B 
on the base of the year A is too: the relative decline in the price 
of sugar is exactly offset by the relative advance in the price of 
10.20 
5.25 
or 194, for the year B on the base of 100 for the year A : in this 
form the much larger absolute increase in the price of a barrel 
of flour completely overshadows the smaller absolute decrease in 
the price of a pound of sugar. A ratio of aggregates allows 
absolute differences to take full effect on the index number. 


flour. The ratio of aggregates, on the other hand, is 


1 Clearly, the only difference in the two computations is that in obtaining the ratio of 
averages both numerator and denominator of the ratio of aggregates are divided by the 
number of individuals in the group. This, of course, does not affect the ratio, 
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This leads to results which may, or may not, be desirable. All 
depends upon the way in which differences are to be viewed in 
the particular inquiry in hand. If we are to regard the price- 
changes from the point of view of a person compelled to purchase 
a consignment of goods in the two years, absolute differences are 
of the utmost significance. If, on the other hand, we are to 
think of price-changes as individual indicia of an underlying 
monetary influence affecting the value of money, it may be prefer- 
able to think of differences in relative terms.’ Ratios of aggre- 
gates and averages of relatives both have their places in index- 
number construction. The important thing is to recognize the 
essential difference in the nature of the two forms. 

The use of an index number — whether it be in the form of a 
ratio of aggregates or an average of relatives — always involves 
questions of representativeness. As already stated, every index 
number is expected to register relative differences in a group of 
variables. Sometimes the group, the change of which is to be 
measured by the index, is coextensive with the group actually 
recorded in the data; in this case, the requirement of representa- 
tiveness is manifestly fulfilled. More frequently, the individuals 
of the group covered by the actual data can only be regarded as 
specimens of larger masses of variables lying behind the records.? 
The question then arises: to what extent are the individuals in 
hand a satisfactory representation of the whole mass? No 
question more difficult than that of representativeness arises in 
the entire technique of index-number construction. 

Sometimes the fulfillment of the requirement of representative- 
ness may be left to chance. This may be done, of course, if the 

1 From this point of view, the price of each commodity may be thought of as affected in 
some measure by a pervasive monetary influence. It may be supposed that some prices 
will be affected more than others, but that, in general, there will be an observable tendency 
for all to be affected by this general influence in about the same measure relatively. An 
average of relatives would then seem to be in reasonable order. (See A. J. Flux, Statistical 
Journal, August, 1907, p. 620, and March, 1921, p. 175.) Whether this possible distinction 
is actually to be made in the analysis of the value of money is not altogether clear, 
but an analogous distinction certainly is to be made in some lines of statistical inquiry. 

2 Thus the eight players whose records appear in Table 113 may be considered as repre- 
sentatives of the whole company of football players. They must be so regarded if the 


inquiry relates to loss of weight by football players in general, and not by this particular 
group of eight players. 
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members of the group have been brought together under con- 
ditions of dependable sampling. Ordinarily, however, the com- 
position of the group cannot be safely left to random selection. 
The larger mass of variables is ordinarily known to consist of 
distinct groups of variables each with its characteristic move- 
ment. Under these conditions, haphazard weighting is bound 
to be illogical weighting. Proportional representation has to be 
explicitly provided for; in other words, carefully selected weights 
have to be introduced in deriving the index number. The signifi- 
cance of this in the making of index numbers will appear in the 
next chapter. 


CHAPTER XXIII 


WEIGHTED INDEX NUMBERS 


Tue problem of constructing a weighted index number may be 
first considered under somewhat simplified conditions. Suppose, 
in the first place, that the available data are either absolutely 
complete or so abundant as to free the construction of the index 
number from all restrictions due to lack of data. Suppose, in 
the second place, that the grouped variables to be measured 
relate to a single pair of dates or places. Conditions such as these 
sometimes obtain in index-number construction. They simplify 
substantially the general problem of determining the type of 
index to be employed. 

Perhaps the most rudimentary type of weighted index num- 
ber is one derived from a ratio of weighted averages. Suppose 
the problem is to compute ndex numbér of changes in the 
farm prices of five leading cereal crops in the United States 
for the dates 1914 and 1919. ‘The data are given in Table 117. 


TABLE 117. AVERAGE Farm PrIcE ON DECEMBER 1, AND ToTAL PRODUCTION 
oF Five CEREAL CROPS IN THE UNITED STATES, 1914 AND 1919 


PRICE PRODUCTION AGGREGATE VALUE 
(cents per bushel) (million bushels) (million dollars) 
CEREAL 

1914 1919 1914 1919 1914 1919 
(Coho) are eee 64.4 134.7 2673 2811 1721.4 3780.4 
Wilace tele rues 98.6 215.1 801 O41 878.5 2024.1 
Oats see ie hace 43.8 Ttoks II41 1184 499.8 846.6 
Bavleye se sf a0: 54.3 121.0 195 166 105.9 200.9 
IRIS. Ne hein eG 86.5 134.5 43 88 BFE 118.4 
Motaliees ee inec4 726 676.8 4043 5190 3242.8 6976.4 


An index number in this case may take the form of a ratio of the 
average prices per bushel on December 1, 1914 and 1919, when 
352 
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all five crops are thrown together. These average prices are easily 
obtained, of course, by dividing the aggregate values of the five 
crops on these two dates by the total number of bushels ex- 
changed. The figures for 1914 and 1919 are 3242.8/4943, or 
65.6 cents per bushel, and 6976.4/5190, or 134.4 cents per bushel, 
respectively. The index number for 1919 on the base of 1914 = 
100 is therefore 202; the index number for 1914 on the base of 
1919 = 100 is 49. Under the conditions assumed, with all the 
constituent series expressed in the same unit, the computation 
of an index number in this form is a simple and straightforward 
matter. 

But this form of index, simple as it may appear to be, may not 
be appropriate. It may not be in accord with the purpose of the 
index to consider a bushel of oats of the same importance as a 
bushel of wheat. And when diverse units appear among the 
constituent series — as in a list of food stuffs such as bread, milk, 
butter, meat, eggs — the computation of such a ratio of averages 
rarely, if ever, yields useful results. Other forms must ordinarily 
be sought. 

The issues involved in this general problem of developing the 
best possible form of weighted index may be profitably raised in 
the light of concrete data. Simple cost-of-living data may be 
used for this purpose. Representative items for the years 1913 
and 1920 are given in Table 118. Let it be supposed that data 
of this kind cover the complete details of prices and quantities, 
and that the problem is to measure the relative cost of living in 
the two years 1913 and 1920. What forms are to be adopted in 
developing under these conditions an appropriate index number 
of the cost of living? 

At least two distinct varieties of index number are to be con- 
sidered. One variety is designed to measure changes in the cost 
of a fixed_or standard budget of articles. In this type of index, 
the p’s (prices) vary from one period of time to the other, but 
the q’s (quantities) donot. The other variety of index is intended 


to measure es in the budget of commodities actually con- 
sumed, changes occurring in both the quantities consumed and 
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the prices paid. In other words, in this second type of index, 
both p’s and q’s change from one period of time to the other. 


TABLE 118. AVERAGE MontTHLY CONSUMPTION AND MONTHLY AVERAGE PRICES 
or Foopsturrs CoNSUMED By TYPICAL FAMILY IN 1913 AND 1920* 


CONSUMPTION PRICE 
ARTICLE OF Foop UNIT 
1913 1920 1913 1920 
(a) (0) () (@d) (e) (f 

Sinoinisteak. |) 4 lb. 32 Das $.254 $.461 
Round'steak 1. 3. . lb. Be 35 223 426 
Rabsroastes oe) ne lb. Bi 27 198 348 
@hucksroast = 4g): lb. Bn 35 .160 .278 
IBOLKICHOpS) ses aera lb. 36 30 .210 .408 
IBACOMMEM snp ige ct, mad Ib. 17 21 .270 539 
LATIN ce sede sete ost cts. eee lb. 22 30 .269 577 
Wand Basler Wes ae, Ib. 34 30 158 293 
lol Oo aa) seen lb. 23 31 Dus .460 
Pesce sean) oS doz. 61 60 345 .530 
(Buttery ae es atts: lb. 66 54 383 .672 
Cheese 8; ives. 2 lb. 12 15 p22 418 
IIS oi a ew ee te qt. 337 370 .089 162 
TENWeRNGlb oy an ee me lb. 531 590 056 118 
OUTEt ee ee we oe Ib. 264 275 .033 .088 
IRiGemptee a MAY cf a. Ib. 35 35 .087 187 
ROLALOCS Hr e ew icine lb. 704 700 .O17 .103 
SLURP OU Ss US te SA at lb. 147 170 055 267 
(Coficemerete Hobs ae lb. 40 51 .298 492 
Cage mec oe. Yee Ses Ib. 8 . 7 544 741 


* Data such as are shown in this table may be conveniently represented in symbolic form. Thus, 
the prices of any single year may be expressed in a series of p’s — p’, p””, p’”, p'”, bY; the quantities 
of the same year by q’,q’’,q’’’,q4”,q". The prices and quantities of different years may be dis- 
tinguished by subscripts; thus the prices of an article in the base year may be shown by #’0; the 
price of the same article in a later given year as p’1; and similarly with quantities. Index numbers 
of prices and quantities may be represented by oP1 and 0Qi, where the subscript before the letter 
refers to the base year, and the subscript after the letter to the given year. 

Both of these varieties of index number are important. They 
merely serve somewhat different purposes. If an industrial 
concern is endeavoring to adjust wages to changes in the cost of 
living, the firm may be primarily interested in an index number of 
prices for a fixed or standard budget. On the other hand, some 
social relief agency may be primarily interested in the changes in 
actual expenditures for food rather than in the estimated cost of 


a fixed food budget. Both varieties of index number therefore 
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have their uses; both are to be carefully examined. However, 
attention may well be directed first to the simpler form in which 
the list of articles and the quantities consumed remain unchanged. 

The most direct way in which to secure an index number 
measuring the change in the cost of living of a fixed budget is 
to multiply the quantities of the commodities included in this 
budget by the prices at one date and then by the prices at the 
other date. The aggregate sums expended at the two different 
dates can then be readily converted to index-number form by 
calling either one a base equal to too, and expressing the other 
as a percentage of this. 

Calculation of an index number of this simple direct method 
requires no extended explanation. The steps to be taken are 
shown in full in the table on the following page. 

It is to be noted that, in contrast to the cases considered in the 
preceding chapter, the price changes which determine the index 
number in the case of this cost-of-living index do not all possess 
equal influence. Some of the prices are multiplied by much 
larger quantities than are others. Here is the factor of weighting 
in index-number construction. Its importance can be readily 
perceived. An aggregate obtained by merely adding the various 
prices (per unit) in 1913 and comparing this total with the 
corresponding total of the several prices (per unit) in 1920 would 
have virtually no significance, for the individual prices relate to 
articles of widely differing importance in household consumption, 
and furthermore, are prices which attach to units that occur in 
household expenditures with widely differing frequencies. Some 
system of weighting which takes into account the quantities which 
occur in the budget is necessary if results are to represent actual 
costs of living. 

In obtaining the cost-of-living index just developed, it was 
assumed that the problem was to measure the relative cost in 1913 
and 1920 of a fixed budget, this budget consisting of the commod- 
ities and quantities consumed in 1913. But the problem may be 
different from this. It may involve, rather, the measurement of 
the relative cost of food in the two years when allowance is made 


356 STATISTICAL ANALYSIS 


for any changes in the quantities consumed. Just what altera- 
tions of the method of obtaining the index are necessitated by 
this change of purpose ? 


TABLE 119. CALCULATION OF INDEX NuMBER OF Cost or LIVING 
FoR Fixrep BitL or Goons, 1913 AND 1920 


(Aggregative Form) 


Consump- Pees EXPENDITURE FOR I913 

Wenders os TION QUANTITY AT PRICE OF 

1913 1913 1920 1913 1920 
Sirloin steak .| lb. 32 $.254 $.461 $8.128 $14.752 
Round steak .| Ib. 32 ORE, 426 7.136 13.632 
Ribsroasts) .9 21) elb: 31 198 348 6.138 10.788 
(Chucks roast wr) e kb: 31 .160 .278 4.960 8.618 
‘Pork chops . .| lb. 36 .210 408 7.560 14.688 
Baconwarsme a ca|e tbs 17 .270 539 4.590 9.163 
igen ot 6 oe ely 22 269 Se) 5.918 12.694 
Hardee are, api elDs 34 158 293 5.372 9.962 
Is 4 oe oF Joll| alee 23 ne .460 4.899 10.580 
MGS, S ig a. ol) ele 61 345 530 21.045 32.696 
Buttermere ue (Pel: 66 383 672 25.278 "44.352 
Overs 5. 5 5 | tie, 12 221. 418 2.652 5.016 
While 3 38" 5 ou aehs ROH .089 162 29.993 54.504 
IB 5 5] lloy, 531 .056 .118 29.736 62.658 
Unkouie 3. ag oie elles 264 033 .088 8.712 23.232 
IRicewermes 0 eit bs 35 .087 .187 3.045 6.545 
IRotatoesw@ a) lb: 704 O17 POR 11.968 72.512 
SURO, 5 | al! they 147 ORs > Aoy 8.085 39.249 
Coficemue. 5) aaa: 40 .298 492 II.920 19.680 
Pea eas sees tle wtb; 8 -544 741 4.352 5.928 
211.487 | 471.339 

The index on the base 1913 = 100 is, therefore, 471-339 223.* 
211.487 
* In the notation previously explained this aggregative index number is represented as follows: 
hs Z1g0 
Z pogo 


With the idealized conditions under which the inquiry is as- 
sumed to be conducted, the change of purpose does not entail 
any substantial increase in the labor of obtaining the index. It 
does involve, however, the introduction of 1920 quantities in 
addition to the 1913 quantities used exclusively in the computa- 
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tion of the index developed above. Determination of the actual 
budgets of 1913 and 1920 is easily accomplished by multiplying 
the quantities consumed in these two years by the prices paid at 
the time. Thus, 1913 quantities are to be multiplied by 1913 
prices to obtain the actual expenditures in that year. Similarly, 
1920 quantities are to be multiplied by 1920 prices. The aggre- 
gate expenditures in the two years are obtained very simply in 
this fashion. Reduced to relative form, these give the desired 
index.’ The computation in tabular form is as follows : 


TABLE 120. CALCULATION OF INDEX NUMBER OF CosT oF LIVING 
FOR VARIABLE BILL OF GOODS, 1913 AND 1920 


(Aggregative Form) 


1913 1920 


ARTICLE Unit 
Quantity Price Expenditure | Quantity | Price Expenditure 

Sirloin steak. .| lb. 32 | $.254 $ 8.128 25 | $.46r | $11.525 
Round steak. .| Ib. 32 223 7.136 35 426 14.910 
Vigne eeie | ley 31 198 6.138 27 348 9.396 
Chuck roast. .| lb. 31 .160 4.960 35 278 9.730 
Pork chops . .| lb. 36 .210 7.500 30 .408 12.240 
iBaconiee cae, ceelb: 17 270 4.590 21 539 II.319 
iebi—sw a 9 gil iloy 22 269 5.918 30 BS 77 17.310 
(Dentch = 6 ge acl lor 34 158 372 30 293 8.790 
Efens aes ee ae PLD: 23 Pug) 4.899 31 .460 14.260 
Ieee ol eleye 61 345 21.045 60 530 32.160 
WBRUG 5  g |) Moye 66 383 25.278 54 672 36.288 
@heesez es «= -alrlb: 12 221 2.652 75 .418 6.270 
Wit Gy Sw TE 337 .089 29.0903 370 162 59.940 
Breadieu ean Ds 531 .056 29.7306 590 118 69.620 
I 1OUT Ee een lb: 264 033 8.712 275 .088 24.200 
a2 of) Ak. oul ties 35 .087 3.045 35 PTS 6.545 
Potatoes . . .| lb. 704 .O17 11.968 700 .103 72.100 
Suc cee De 147 055 8.085 170 .267 45.390 
Cofieem An eee: 40 298 11.920 51 492 25.092 
UNS. “bls. cet <callel ey 8 544 4.352 7 741 5.187 

211.487 492.272 


402.272 _ , 


The index for 1920 on the base of 1913 = 100 is 
211.487 


33 
Zpigi 


1 The index number here is given by the expression pV; = Sp 
ogo 


. 
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While the index number in this form measures accurately the 
change in the actual expenditure for foodstuffs in these two years, 
it does not give us information concerning the extent to which the 
change in the total expenditures is due on the one hand to changes 
in the prices of the different articles purchased, and on the other 
hand to changes in the quantities consumed. These two sets of 
variables unite to effect the change in total expenditures. For 
many purposes, it is desirable to be able to distinguish the 
effects of the variable prices and variable quantities; in other 
words, to obtain separate and consistent indices of prices and 
of quantities. In fact, it is this problem of differentiation 
which is in some ways the most difficult problem in index- 
number construction. 

The form of index number which seems to serve best when both 
prices and quantities are variable is a compromise between two 
ratios of aggregates. Consider the change of prices from 1913 
and 1920 in the light of the data of Table 120. If the quantities 
of 1913 are taken to be representative (that is, are employed as 
weights), a measure of the change of prices is to be found in a 
ratio of the aggregates obtained by multiplying, first, the prices 
of 1920, and second, the prices of 1913, by these 1913 quantities.? 
The index for 1920 on this basis has been found to be 223. But 
if the quantities of 1920 are taken to be representative, the 
measure of price-change becomes a ratio of aggregates calculated 
by multiplying, first, the prices of 1920, and second, the prices of 
1913, by the 7920 quantities.” The index for 1920 on the base of 
1913 is, in this case, 224. Ordinarily there is no reason for 
regarding either the 1913 or the 1920 set of quantities as more 
representative than the other, consequently a mean proportional 
is taken of the two ratios of aggregates obtained by using the two 
different sets of weights. The price index for 1920 on the base of 


REASON 9 2218 
211.487 - 219.537 


1 The ratio in symbolic form is 


1913 is, therefore, = V222.9X 224.2 = 223.5. 
Zpiqo, 
DZ pogo 
Zin, 
Pog 


2 Here the ratio in symbolic form is 
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In the customary symbols, the index is given by the expression, 


~pigo | = Pig, 

Zpodo LUpou 
two ratios of aggregates. It is commonly referred to as the 
“ideal” form.! 

Index numbers in this standard form fulfill unmistakably two 
important tests of index-number validity. Reference has already 
been made to the base-reversal test. In the preceding chapter 
it was pointed out that certain index numbers in the form of 
averages of relatives fail to meet this test. But no aggregative 
index numbers can be similarly defective. Since all aggregative 
indices are converted to relative form only after the aggregates 
have been obtained, it follows that any year in the series of aggre- 
gates may be made the base year, and that the base may be 
readily transposed from one year to another. The standard or 
“ideal”? form of index, being aggregative in character, meets 
perfectly the base-reversal test. 

A second test of validity — a test strongly urged by Professor 
Irving Fisher — is the factor-reversal test. This test rests upon 
the notion that the great bulk of index numbers relate to groups 
of variables which are associated with other groups in the con- 
stitution of more inclusive factors; just as prices and quantities 
constitute two sets of variables which when taken together yield 
the variable total expenditures in the cost-of-living analysis. 
When this condition obtains, it may be argued that any index of 
the change in the one element — say, prices — should be so set 
up that a corresponding index of the change in the other element 
— say, quantities —is to be secured by merely reversing the 


The index is clearly a geometric mean of 


1 This is the index to which Professor Irving Fisher has brought such effective support- 
It has the approval also of such other eminent authorities as Bowley, Walsh, and Young. 

The compromise of weights which is effected in the “ideal” form may be secured by 
averaging the quantities before developing the aggregates, thus: 


oP, = D2 thr _ Ugo + a)hr 
D4(qgo + q1)po LZ(go + g1)po 
Pi = DV q0 + a pr 
=V q + qi fo 
Both of these forms are to be regarded as variants of the “ideal” formula. 


or 
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factors in the formula; that is, by merely interchanging the p’s 
and the q’s. 

Under the factor-reversal test, the two analogous indexes — 
say, of prices and quantities — thus obtained, must be consistent 
with the index of that third more inclusive factor — total ex- 
penditures, values, exchange, or some like factor — which is con- 
stituted of the prices and quantities together. 

The “‘ideal’’ formula meets the factor-reversal test perfectly. 
When the price and quantity indexes are multiplied together, the 
result is identical with that obtained by calculating the value 
index directly. The “‘ideal” price index for 1920 on the base of 
1913 has been found to be 223.5. The corresponding quantity 
index is 104.1. The product of the two is 232.7, which is the 
figure obtained for the value index for 1920 on the base of 1913.” 
The ‘‘ideal”’ formula thus fulfills completely the factor-reversal 
test,. 

Under certain conditions, the “ideal”? form may be fairly 
regarded as the standard type of index number. These conditions 
may be stated as follows: (1) the existence of two sets of variables 
— such as prices and quantities — combining to form a third — 
such as values; (2) the availability of complete — in any case 
ample — data as to both sets of variables; (3) the limitation of 
the measurement of change to a single pair of dates or places. 
When these conditions prevail, the use of the “ideal” formula is 
clearly indicated. 

But these conditions are not always fulfilled. In the preceding 
chapter, cases were considered in which only one set of variables 
was involved, and in which index-number forms other than the 
‘‘ideal”’ were applicable. There remain to be considered certain 
complications introduced by (1) inadequacies of data, and (2) the 


1 When this is done, the “ideal” index for quantities is 
2) z 
cOi=s nho , 2nhi 
Zqoho Laohr 
2 That this is bound to be the case is easily demonstrated algebraically : 


A Oe ~Pido , ZPiN y “ee . 2gpi _ ZPin 
Updo Uhodi ZqoPo LZdoh1 Thoda 


= oVi1. 
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necessity of making the index serve for a series of points in time 
or space, instead of only two. The latter of these conditions may 
be examined first. 

Consider the problem of developing an index for a considerable 
series of years, the complete set of index numbers to give an 
accurate picture of the group movement for the period as a whole 
as well as for any two or more years within the period. For 
instance, an index of prices is sought not for the two years 1914 
and 1924, but for all the years from 1914 to1924. The ideal form 
developed in the previous chapter would require the computation 
of an index number for each possible pair of years in this series of 
eleven years, or a total of 55 index numbers. But to urge the 
necessity of such computation is to offer a counsel of perfection. 
What is needed in actual practice is an index which, computed 
but once for each year of the full period, permits without serious 
error a comparison of any two years. 

The problem may be approached in a somewhat different way. 
Suppose an index number is computed by the “‘ideal”’ formula for 
each successive pair of years from 1914 to 1924; thus for 1915 
on the base of 1914, 1916 on the base of 1915, and so on, to the 
index number for 1924 on the base of 1923. Suppose these 
individual index numbers, first obtained as a series of year-to- 
year links, are welded together so as to make a chain for the full 
period of eleven years. The question may then be raised: how 
far is the result thus obtained for the year 1924 on the base of the 
year 1914 consistent with the result obtained by comparing 1924 
with 1914 by direct application of the “ideal” formula for these 
two years? 

The issue may be stated in still another form. Suppose at the 
end of the period of eleven years, 1914-1924, an additional theo- 
retical year is added for which the data are identical with those 
of 1914. Suppose that the prices of this year on the base of 1924 
are computed by the ‘‘ ideal” formula and this index is welded on 
to the chain for the period 1914-1924. Will the final index for 
the added year be identical with the index for the initial base 
year 1914, for which all the data are the same, and for which 
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consequently the index number should be the same? This test 
of index numbers is known as the circular test. It is, obviously, 
entirely objective and readily applicable as soon as more than 
two dates are covered. Any index number meeting the circular 
test will give for a final year in which the data are identical with 
the initial year, a result identical with the index number of the 
initial year. Failure to meet the test would seem to raise serious 
questions regarding the validity of the index as a measure of 
the course of events through the full period.? 

Index numbers constructed on the “ideal”’ formula for the suc- 
cessive pairs of years do not, when welded together, meet the cir- 
cular test. In Professor Fisher’s opinion, there is no reason why 
a series of index numbers should meet the circular test. But 
the fact remains that index numbers have to be computed for 
series of years, or for numerous different places. The most 
direct way in which to use the “ideal” formula is to make the 
computation for each successive pair of years on this basis and 
then to weld together the link indexes thus obtained to secure a 
continuous index for the full period. Empirical tests show that 
index numbers obtained by this means, for a considerable period, 
do not meet as well as indexes obtained from other formulas for 
the full period, the test of the “ideal ”’ formula applied directly to 
the two terminal years of the period.?» Other formule, therefore, 
are to be considered in the derivation of index numbers for more 
than two dates or two places. 

Another difficulty in the use of the ‘‘ideal”’ formula is to be 
noted. The “ideal ”’ formula involves changes of both factors in 
each application of the formula. Thus, in a price index, both 
prices and quantities differ in each computation. In any single 
pair of years it is easy to determine what part of the change is 
wrought by the change in prices, what part by the change in 
quantities. But if a series of index numbers for successive pairs 


Tt should be noted that successful meeting of the circular test does not in any way prove 
that the index in question has accurately measured the course of events between the terminal 
years. 

2 Professor Persons’ empirical tests (see Review of Economic Statistics, prel. vol. IIL, 
Pp. 103-113) appear to be conclusive on this point. 
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of years are welded together to obtain an index for a series of 
years, it is almost impossible to say what proportion of the final 
result is to be accounted for by changes in prices and what part 
by changes in quantities. With both prices and quantities 
available, the final result is a composite which represents neither 
price movements nor quantity changes separately. 

This is a difficulty which is common to all chain indexes; that 
is, indexes obtained for longer periods by welding together link 
index numbers obtained for each successive interval. Link 
index numbers doubtless have certain distinct advantages, the 
most notable being that, in measuring a group movement for 
any pair of years, numerous adjustments may be made to changes 
in the list of articles to be included, or in the respective qualities 
of these articles, or in the extent to which these articles are 
represented by satisfactory data. New commodities may he 
introduced, obsolete ones may be dropped; changes in relative 
importance may be readily represented. In other words, link 
index numbers are a more flexible device. But when welded 
together they yield a final result which at times is distinctly 
ambiguous and uncertain. Though formerly in high favor, 
chain indexes obtained by welding link index numbers are now 
commonly viewed with suspicion. Fixed-base indexes appear to 
be preferable. 

Granted that there is need, therefore, for a general form of 
index number which may be used continuously for considerable 
periods, what is to be concluded as to the best form? In general, 
no one form is to be singled out for exclusive use. There is, 
however, a reasonable presumption in favor of forms embodying 
weights which are constant over considerable periods.’ This 
enables the index to register the force of a single factor. The 
figures, consequently, are more easily understood and more 
generally serviceable.? 


1Of course, care must be exercised to make sure that the constant weights are truly 
representative. Furthermore, the weights must be corrected periodically if they are in 
danger of becoming unreal or unreasonable. Doubtless the feasibility of constant, as well 
as variable, weights differs from case to case. 

2 The case for constant weights is further strengthened by the difficulty commonly en- 
countered in obtaining the necessary data for variable weights. 
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If, then, constant weights are to be used, what type of average 
or aggregate is most satisfactory? Apparently two general forms 
are to be viewed with favor: (1) an aggregative index with 
weights derived from the base year; (2) a weighted geometric 
index with weights drawn from a period of such length, or factors 
of such stability, as to reduce to negligible proportions any danger 
of weight bias. These two forms may well be considered further. 

The advantages of an aggregative index have already been 
noted. The simple aggregative form with constant weights which 
is here recommended is (for prices) 

_ 2Prgo 


= pogo 


This is, of course, merely one of the terms of the ‘‘ideal”’ formula.! 
Results obtained with this type of index check closely with those 
which are secured by applying the “‘ideal ”’ formula to the succes- 
sive years. The index is both accurate and unambiguous, and 
may be readily developed continuously for considerable periods. 

An alternative to this aggregative index is the weighted 
geometric index with constant weights based upon values over a 
period sufficient to assure stability in the weighting factors. The 
geometric mean, as previously indicated, has the great merit of 
being independent of the base and of involving absolutely no bias 
intype. Besides being shiftable as to base without recomputa- 
tion — a virtue which is of very considerable importance in the 
use of index numbers — the geometric mean has the distinct 
merit of minimizing the effects of asymmetry in the items — a 
matter of no little importance in dealing with data which con- 
stitute mere selections from a larger universe. But weighting in 
index-number constructionis a process which has its own dangers. 
The question remains, therefore, as to whether a weighted geo- 
metric mean may be regarded with favor. 


1 An arithmetic mean of price relatives weighted by values of the base year reduces to 


the same form; thus, 
(Ff ) toma 
bo _ 2190 


2 pogo = pogo 
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Reference was made in an earlier connection to certain errors 
which may creep into an index through the form of average 
employed; but bias in an index number may be introduced, not 
only by the type of average employed, but by the weights as well. 
Thus if, in the development of a weighted mean of price relatives, 
values of an earlier year are used as weights, the effect of these 
weights is to give the index a downward bias. The converse 
holds true, of course, for values of the later year. In general 
there is a tendency for weighting on the basis of values of a single 
year to introduce a slight bias in index numbers. 

In connection with the examination of bias in value weights, 
it should be noted, however, that such value weights are likely 
to be fairly representative and stable. This follows from the 
fact that the individual p’s and q’s are likely to move in opposite 
directions, at least when regarded relatively within the general 
system of p’s and q’s in which they appear. Thus, if the relative 
price movement of a particular commodity has been markedly 
upward, its quantity movement has probably been relatively 
restrained on that very account. Relatively high prices are 
likely to mean relatively low quantities, and vice versa. It 
follows that the product of prices and quantities (in other words, 
values) are likely to remain relatively stable in large masses. 
If this is so, there is not likely to be serious bias in weights of 
commodities based upon large value items even if these relate 
to a single year. 

An illustration of this is to be found in weights based upon 
values added in manufacture in the different major industrial 
groups. Figures are given for the census years 1899, 1904, 1909, 
1914, and 1919, in Table 121. Despite the fact that these five 
census years represent strikingly different conditions of manu- 
facturing enterprise in the United States, the relative importance 
of the different groups as evidenced by their ‘‘values added” 
remains remarkably constant. 

In general it may be safely concluded that serious bias in an 

‘index number is not likely to arise from carefully selected weights. 
Systems of weighting which yield approximately the same 


~ 
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weighting factors will result in essentially the same index number. 
Refinements of weighting are rarely worth while. The more 
important issue lies between fixed and variable weights. If 
care is exercised to see that the weights are thoroughly repre- 
sentative and stable, the geometric mean with constant weights 
(periodically revised) may be used with eminently satisfactory 
results when the index is to cover a considerable series of dates. 


TABLE 121. PERCENTAGE DISTRIBUTION OF ‘f VALUE ADDED BY MANUFACTURE ” 
IN Major Groups OF INDUSTRIES 


Grove 1899 1904 1909 1914 1919 
Alliindustries es 24) 2s 6 ee an LOCO:021| 1O0000|100.08]| TOO OMlETOOLO 
Roodtandiproductsiasm same cnr 8.6 8.6 8.8 10.0 9.6 
Aesesiless ehatelsomosHes 5 6 o 6 o a Eee 14.3 15.4 14.4 T5638 
Jron and steel and products . . . . .| 16.9 16.0 16.0 14.8 18.2 
Lumber and manufactures . . . . .| 10.9 1D 10.2 8.5 6.8 
eatherand productss sass acai 3.8 3.9 3.8 3.6 3.6 
Papetandsprintin Calne eee rinse 8.2 8.8 8.5 8.7 6.8 
Miaquorsiand beverages’. 92 4) ae 6.0 Bey ey Bae) Ta 
Chemicals and products . .... . 6.4 7.0 7.0 PD. Te 
Stone, clay, and glass products. . . . 3.8 4.3 4.1 3.8 DG) 
Non-ferrous metals and products . . . Aieis 4.2 4.1 4.0 3.4 
shobacconmanutactures 9 Gre) BES RB 2.8 2.9 Za 
Vehicles for land transportation . . . 2.6 2x8 3.0 4.5 6.2 
Railroadmepaitl Shops: a... \le ae ew cle 2.4 2.6 2.6 3.0 Ba: 
IMbGeClRNOS 5 5 a 6 a co o 6 o FE Wl 7.9 8.0 9.3 13.2 
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Peculiarities of the data sometimes create conditions favorable 
to the use of one type of index, unfavorable to the use of others. 
Aggregative indexes depend upon an adequate record of prices 
and quantities. But data on these factors may not be available 
in sufficient quantity. Value figures may, on the other hand, be 
unusually complete. If relatives can be had, these value data 
may be employed as weights for an index in the form of an 
average of relatives. 

Consider, for example, the construction of an index of the 
volume of manufacture. Paucity of data here creates a serious 
problem. A geometric mean of relatives has the distinct merit 
of permitting a consolidation of data from different series of 


WEIGHTED INDEX NUMBERS 367 


widely divergent character. Thus, it is possible to combine 
relatives derived from production data, from data on the con- 
sumption of raw materials, on employment and unemployment, 
on machine activity. From all of these indicia of production, 
series of relatives are obtainable which bear directly upon the 
movement under investigation! These relatives may then be 
weighted by “values added in manufacture” as reported in the 
Federal censuses. An average of relatives in this case, as in 
certain others, has marked advantages in index-number construc- 
tion. The weighted geometric mean of relatives may well be 
recognized as one of the most serviceable forms of index number. 

In the discussion thus far no mention has been made of ease 
of computation. Nevertheless, this factor is important in the 
making of index numbers. One cf the chief objections to variable 
weights, for example, is the extent to which they add to the labor of 
computation.” It is difficult to make accurate statements regard- 
ing the relative time required for the calculation of the different 
forms that have been mentioned; so much depends upon the 
labor-saving devices employed, the competence of the computers, 
and their familiarity with the processes involved. It may safely 
be stated, however, that the computation of no one of the forms 
recommended involves an amount of labor which is in any way 
prohibitive. 

The making of index numbers is a highly important phase of 
statistical analysis in economic and business science. It is a 
phase of analysis requiring great care and discrimination in the 
assembling of data and determination of weights, and high tech- 
nical proficiency in the selection and adaptation of the various 
index-number forms. Fortunately, it is a phase of analysis 
which has had the attention of some of the most astute statistical 
technicians, with the result that standards of correct workman- 
ship are in no part of statistics more clearly defined nor more 
readily applicable. 


1 Tt would be quite impossible to combine these several lines of evidence in any aggregative 
index number without the use of pure abstractions which would appear altogether unreal 
if not quite unnatural. 

2Variable weights commonly add even more to the task of assembling the necessary data. 


CONCLUSION 


CHAPTER XXIV 


THE NATURE OF STATISTICAL RESULTS 


Tue distinguishing feature of statistical data is that they 
reflect a complexity of causes. Statistical methods, according to 
Yule, are methods ‘“‘especially adapted to the elucidation of 
quantitative data affected by a multiplicity of causes.”* It is 
the existence of multiple causation that creates the need for 
statistical analysis; and it is this same factor that makes difficult 
the interpretation of statistical results. 

The situations with which statistical method has to deal are 
to be contrasted with those prevailing in laboratory experiment. 
Experiment undertakes to create conditions under which only 
one causal element is allowed to vary at a time. In other words, 
laboratory experiments observe the law of a single variable. 
With one causal element varying, the resultant changes are 
carefully noted and the relationship between cause and effect 
thus determined. The results so obtained may be tested by 
further manipulation of factors. By this means, the laws of 
the physical universe are brought within the range of scientific 
knowledge. 

In a good many fields, however, it is impossible to control 
conditions so as to limit variation to one factor at a time. In 
physical science, the disturbing elements are ordinarily so slight 
compared with the factors upon which experiment is being 
focused that these uncontrolled factors may be largely ignored.” 
As we move away from the field of exact science, the importance 


1 Introduction to the Theory of Statistics, p. 5. 
2 When these factors have to be considered, they are dealt with ordinarily under the head 
of random errors. 
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of dealing with multiple causation increases. In biology, for 
example, though many factors are subject to a measure of experi- 
mental control, many others defy such treatment. As a result, 
multiple causation is an important condition in biological investi- 
gation. Raymond Pearl states the case admirably as follows: 
“These (statistical) methods are indispensable in this particular 
case (of research in attacking the problems of genetics), because 
a multitude of separate and distinct causal factors discretely 
distributed in respect of their action are concerned in the deter- 
mination of the make-up of the adult organism. Since the locus 
of action of all of these factors is in each case the individual, it is 
impossible, generally speaking, to study the action of any one 
factor free of the influence of all others. This directly implies a 
necessity for the application of statistical methods.” 4 

In social sciences, the impossibility — or, at best, very limited 
possibility — of experiment has been repeatedly noted. The 
so-called law of the single variable cannot be made to apply. 
The observer of economic and business phenomena must take 
the records as they stand, registering as they do in each case 
the operation of numerous factors. He must then deal with the 
records on the assumption of multiple causation, and through 
statistical analysis gain as much insight as possible into the 
nature of the causes that are at work. 

The fundamental distinction between experimental and statis- 
tical methods is that experimental methods are manipulative 
and statistical methods descriptive. In the laboratory, condi- 
tions are set up artificially; the elements are made to behave in 
certain ways at the will of the observer. In statistical investiga- 
tion, on the other hand, conditions are largely beyond the control 
of the observer. The statistical worker must take the records 
as they exist, and, by subjecting them to analysis, make them 
convey as accurately as possible a real account of the factors 
which have given rise to the situation depicted in the data. Here, 
as before, Pearl has stated the matter excellently. Speaking of 
biometrical methods, which are essentially statistical in character, 


1 Modes of Research in Genetics, pp. 17-18. 
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he states that these methods “are, in last analysis, strictly and 
purely descriptive in character. There are but two general ways 
of acquiring and formulating a knowledge of natural phenomena. 
These are the descriptive method on the one hand and the 
experimental method on the other hand. Biometrical methods 
belong in the first of these categories. The only thing which 
they are able to do is to furnish a description, in quantitative 
terms, of existing phenomena.” ! 

Since statistical method is descriptive, not manipulative, it is 
of the utmost importance that statistical inferences be drawn 
with careful regard to all the limitations of the evidence. A 
number of considerations must be kept in mind in interpreting 
statistical results.2 Some of the more important of these con- 


siderations are : 
Relevancy 


Logical consistency 
Homogeneity 
Adequacy 

Stability 

Accuracy 
Probability 
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The significance of each of these will appear after further dis- 
cussion. 

In the first place, in weighing the results of statistical analysis, 
the question of relevancy must be faced. Are the facts as dis- 
closed unmistakably related to the factors under investigation ? 
Are the results of analysis surely germain to the subject of 
inquiry? Questions of this sort can hardly be answered, of 


1[bid., p. 62. Though statistical methods are merely descriptive, they may be made to 
contribute substantially to better knowledge of the universe. The characteristics of mass 
phenomena may be ascertained. Complex phenomena may be resolved into their several 
parts. By breaking the complex whole into its simpler elements, new notions of the inter- 
relation of factors may be obtained. Knowledge of the course of events may be acquired 
in more exact quantitative form. The fact that statistical method is purely descriptive 
does not keep it from being analytical, nor prevent it from making large contributions to the 
sum-total of scientific knowledge. 

2See A. L. Bowley, “The Improvement of Official Statistics,” Statistical Journal, 
September, 1908; also H. Secrist, ‘‘Statistical Standards in Business Research,” Journal 
of the American Statistical Association, March, 1920. 
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course, without a careful definition of the objectives of investiga- 
tion; but as noted in earlier connections, clear statement of pur- 
pose is a prerequisite of all sound statistical analysis. Relevancy 
is nothing more than the existence of some significant connection 
between the purpose of inquiry and the results of analysis. 

Logical consistency is a requisite of statistical validity which 
has been noted in an earlier connection (see Chapters II and 
III). To maintain logical consistency, definitions must be uni- 
formly applied throughout each investigation. The elements 
which are involved in logical consistency are ordinarily explicit ; 
that is, they are covered by the definitions which are applied in 
dealing with the statistical units, the observational limits, and 
the lines of differentiation which are explicitly recognized in the 
investigation. Only when these matters are uniformly dealt 
with in all parts of the study can the data be regarded as strictly 
comparable; and only when the data are strictly comparable 
can the results of analysis be made dependable. 

The condition of homogeneity is somewhat different... Factors 
affecting homogeneity, as distinct from logical consistency, may 
lie entirely beyond the scope of definition. It follows that even 
if logical consistency is perfectly attained, there may be a com- 
plete lack of homogeneity. Homogeneity is the more inclusive 
criterion. It rests upon the essential likeness of all the elements 
under investigation save those which are supposed to vary and 
the variations of which are subjected to measurement. 

The matter can best be made clear by a few concrete illustra- 
tions. Suppose a comparison is being made of average wages in 
two cotton mills located in different parts of the country. Aver- 
age wages drawn from the payrolls may be computed with perfect 
accuracy and may relate to precisely the same operations in the 
same lines of work in the two mills and yet may be for some 
purposes seriously lacking in homogeneity. Assume, for example, 
that in one of the mills a much larger proportion of women is 
employed in the given operation than in the other mill. Whether 


1Qn the general subject of homogeneity, see Zizek, Statistical Averages (translated by 
W. M. Persons), chap. IV. 
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or not this difference in the composition of the two labor forces 
will deprive the average wage of homogeneity depends upon the 
purpose of the inquiry; but certainly, in some connections, the 
average wage in the one mill cannot properly be placed alongside 
that in the other mill without carefully noting the fact of the 
difference in the proportion of women in the two labor forces. 
Similarly, to cite another illustration, an average of money 
incomes in two different occupational groups may be quite mis- 
leading owing to its failure to recognize income in kind in one of 
the two groups. No amount of care in the definition of money 
income disposes of this complication; in most studies relating 
to the relative economic condition of the two groups of income 
recipients, the data would be seriously lacking in homogeneity. 
The problem of homogeneity is really raised whenever an average 
is taken; and the general rule may well be observed that no 
average is to be derived from cases which are not, for the purposes 
of the particular investigation, essentially of like character. 

Another illustration may be profitably noted. Consider the 
seasonal variation of the price of hides (native steer No. 1) in 
the United States from 1890 to 1913. ‘The relatives are given in 
Table 122 


TABLE 122, SEASONAL VARIATION OF PRICE oF HIDES, 1890-1913 


Monta RELATIVE 
January’ Toe va eye ey een a IOI.5 
DEGUBIRG 6 Go o o o © 98.8 
Marhaba) onsen Lee 95.0 
yA oiul Pea Ase een ce. 92.4 
Way 9 ciao eral ieey eee ee ee 94.9 
JUNE CF Seay 96.6 
atl eront te Pasty Pecte uiomnrey ges 98.3 
TENBUAUS Eo Mos Ato Jee IOI.9 
SHOMSMNI? 5 5 6 » 6 103.8 
Octo bere ae-menEra eae 106.0 
INOMeMN 4 5 6 Go 4 e 105.8 
IDC: 5 oo 6 6 o 4 105.0 

MYOEIGS 5g 6 6 0 6 100.0 


Source: Weekly Letter of Harvard Committee on Economic Research, January 7, 1022, 
p. 6. 
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There is unmistakable seasonality here. The question is: just 
what does the seasonal variation reflect? As a matter of fact, 
it is in no small measure due to lack of homogeneity in hides. 
Hides in the spring months are inferior to hides in the fall and 
early winter months. This is due to the fact that through the 
winter the hides are likely to be in poorer condition because of 
close herding of the cattle; and are likely to carry more mud and 
dirt and long hair. Since sale of the hides is by weight, fall hides, 
containing less waste, are in better condition, and command a 
better price. Seasonality of hide prices therefore reflects at 
least in part a lack of homogeneity in hides. 

Another illustration of lack of homogeneity may be taken from 
the field of railroad operation. In this field common reference 
is made to certain financial and operating ratios. Some of the 
considerations involved in the interpretation of differences in 
these ratios are set forth in a discussion by Mr. Ray Morris: 1 


“Statistics of earnings and expenses are the ultimate check on all the other 
records, and, when taken in conjunction with the statement of work done, 
present the final picture of the operations of the road. Without knowledge 
of the work done, however, earnings and expenses are not an adequate means 
of control. Many roads west of the Transcontinental Divide operate for 
fifty per cent of gross earnings or less (the Oregon Short Line usually operates 
for about forty per cent), while in the eastern states the average is well above 
sixty per cent. Thus, the operating ratio is.an uncertain test of efficiency ; 
the high rate in the newly settled parts of the country make relatively easy 
a showing which the best operation in the world could not accomplish in a 
region of intense competition of long standing, where the struggle for business 
has reduced the margin of profit to the railroad to a minimum. 

“These comments apply primarily, of course, to the statistical use of the 
operating ratio by the banker, or broker, or student of railroad affairs who 
is trying to judge one property in terms of another. The manager of the 
road, confronted habitually by the same set of conditions, can form a great 
many accurate opinions from the reported earnings, and they are of the high- 
est statistical importance to him. Where detail knowledge of the property 
is absent, however, there could scarcely be a more perilous standard of 
railroad efficiency than the relation which operating expenses bear to net 
earnings. A road in mountainous country, like the Denver & Rio Grande, 
must pay a relatively high sum for every ton moved because of the necessity 
of double-heading, or of breaking up trains into short sections; a road 
operating in swampy country, like the Yazoo & Mississippi Valley, is likely 


1See Railroad Administration, pp. 226, 228, 229, 230, 244-247. 
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to have an abnormal track maintenance cost; a road hauling large propor- 
tions of merchandise will have a high ton-mile rate, but also a high ton-mile 
cost, because of the necessity of partial loadings and rapid service. A rail- 
road operating in the cotton belt or the wheat belt will fluctuate greatly from 
one season to another; a railroad in Canada, or close to the Northern border, 
will report a marked increase in operating costs during the winter months. 

“Similar difficulties confront the outsider in making comparisons of 
efficiency based on the ton-mile. When tooo tons are moved 100 miles, a 
service of 100,000 ton-miles has been performed, regardless of the nature of 
the commodity. The Virginian Railway, built to haul coal from the West 
Virginia fields to tidewater, and nearly perfect in its physical equipment, 
frequently produces 100,000 ton-miles by moving a 4ooo-ton train 25 miles, 
with a single engine and train crew. On the other hand, a road loading light 
manufactured articles — baby carriages and bird cages, to use a historic 
example — would do well to load three tons per car, and it would take a 
single train moving approximately 1000 miles, or forty trains moving each 
25 miles, to produce 100,000 ton-miles, under ordinary service conditions. 
The worst of it, from a statistical standpoint, is that most railroads are 
moving a thousand different kinds of traffic, all at the same time, and cannot 
always manage even to haul their coal and their baby carriages in separate 
trains. The ton-mile, in consequence, is an average figure, composed of a 
multitude of dissimilar parts. 

“This does not confuse the general manager, however. He has been 
watching the operations of the Great Valley division for twenty years, and if 
his new superintendent there increases the average train loading from 280 
tons to 310 tons, he regards it properly as a measure of efficiency, because he 
is comparing the results of a known territory operating during a certain 
season under known circumstances, with the same territory and the same 
circumstances in another season. 

“Even this intelligent and discriminating use of the ton-mile, as expressed 
in the terms of average train load, often leads to its own peculiar form of 
error, however. Provided traffic is handled smoothly, at efficient speeds, 
big train loads almost always mean economical working, because they indi- 
cate that the business is being done with the fewest possible locomotives and 
train crews. But if freight is held at terminal points longer than competitors 
are holding it, in order to get a maximum loading, or if the ratings are pushed 
to the limit, with resulting engine failures, blockaded traffic, overtime for the 
crews, and abnormal coal consumption, the big train load may prove a very 
expensive economy. Seven or eight years ago we passed through a period in 
this country when there was a disposition in Wall Street and elsewhere to 
admire the train load for its own self, and there were a good many instances 
where conditions existed not unlike those described.” 


One further case may be considered : a case involving analysis 
of bad debts in retail trade. The percentage of bad debt losses 
may be figured, of course, on the total sales. The percentage 
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in this form doubtless has some significance. But calculated on 
the basis of credit sales, the percentage has distinctly greater 
significance; for, after all, the question of bad debts is not at all 
concerned in cash sales. With greater homogeneity the results 
assume greater significance. The following brief table indicates 
how the relationship of different lines of business may differ con- 
siderably according to the way in which the percentage of bad 
debts is calculated. 


TABLE 123. Bap Drxsr Losses in NEBRASKA RETAIL STORES, 1923 


No =e PERCENTAGE OF BAD Dest Losses 
Stones ertpien VARIETY OF STORE 
ON TOTAL SALES ON CREDIT SALES 
5 @itydepartment . > =... BY 1.30 
7 EV Cityiclothing we ote. 1.00 1.50 
12 General merchandise . . . 1.10 4.45 
7 AWow ahs = AB oS ne Be Shi 267 
9 Hardware. ears ai. io a 67 1.45 


Source: J. G. Knapp, ‘‘Credit Costs in Nebraska Retail Stores,” University Journal 
of Business, March, 1924. 


There are few lines of statistical analysis in which questions of 
homogeneity are more important than they are in the develop- 
ment of appropriate rates and ratios. It is by means of rates 
and ratios that standards are commonly devised on which to 
judge and regulate performance. But every rate and ratio 
raises fundamental questions regarding the significance of the 
relationship which the rate or ratio expresses. A given rate or 
ratio may appear appropriate enough until contrasted with some 
other of better design. In work with rates and ratios, as in all 
other lines of statistical investigation, every effort should be 
made to make sure that the results of analysis bear unmistakably 
upon the objective of investigation, and that the forms adopted 
are the most significant the data will permit. 

Homogeneity is especially important in the treatment of time 
series. In general, it is best to deal separately with periods of 
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time in which essentially different conditions prevail. When 
particular periods of time are obviously subject to exceptional 
influences, they may be deleted in the analysis of the causes which 
normally operate. 

One of the problems involved in homogeneity is the treatment 
of individual cases which do not exhibit the element under obser- 
vation. In general, if the cause under investigation was operative 
in these particular cases, the cases must be included in the 
universe ; if the cause was not operative, they must be excluded. 
Elimination of the individual items on this basis, however, is not 
always possible. Moreover, it is to be noted that an average 
computed without the elimination of nul items may be of interest 
as showing the importance of the factors involved for some wider 
group of cases than those directly concerned. Thus, it may be 
desirable to work out a figure for per capita consumption of 
intoxicants although it is recognized that many individuals 
are total abstainers. 

Ideally, in the analysis of multiple causes, the different masses 
analyzed should agree in all respects save that involved in the 
stated causal relation. Practically, masses never agree in all 
respects, and the most that can be hoped for is that they agree in 
those respects which affect substantially the causal relation in 
question. +s 

Adequacy is another consideration to be kept in mind in inter- 
preting statistical data. Data may conform to all the require- 
ments of logical consistency and homogeneity and still be incon- 
clusive because lacking in representativeness. When the data 
are complete they may lack representativeness owing to limita- 
tions in the universe represented ; when the data are only partial 
they may fail to support inferences because of the inadequacy 
of the sample. Many statistical conclusions rest upon “founda- 
tions of sand” simply because the underlying data are not 
thoroughly representative. | 

The consideration of adequacy is especially important in con- 
nection with sampling. According to Bowley, the essentials of 
the sampling method are: (1) ‘‘every unit in the district or class 
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dealt with must have approximately the same chance of in- 
clusion”’; (2) “selection must deliberately be made at random.’’! 
Care must be exercised not to select unduly from among the more 
obvious and accessible cases. Precautions must also be taken 
to avoid disproportionate representation of particular types. 
Thus, questionnaire returns may give an undue representation 
of the more intelligent and conscientious informants. Adequacy 
of sampling increases rapidly with the number of cases included. 
In general, the simplest practicable rule to observe is to increase 
the size of the sample until successive samples give approximately 
the same result. 

Closely connected with the question of adequacy is the ques- 
tion of stability. Are there reasons for believing that the results 
exhibited in the data will be repeated in later investigations? 
Fortunately, statistical results which depend upon large masses of 
individuals show an extraordinary degree of regularity. Large 
numbers are said to exhibit inertia. There is surprising uni- 
formity, for example, in certain statistical coefficients under 
widely different conditions; such coefficients, for example, as the 
suicide rate, or the mean expectancy of life. Were it not for this 
stability in statistical results, statistical investigation would 
hardly be worth while. It is because underlying laws or tend- 
encies are discernible that statistical results merit the careful 
attention they are given. 

Questions of accuracy are raised by practically all statistical 
material. Accuracy of the raw data may be of count or of 
measurement. As already noted, many statistical items are 
aggregates of individual cases. These are secured by a counting 
process. The attainment of perfect accuracy is here commonly 
possible. The number of students attending a given class, the 
number of shares of stock voted at a given shareholders’ meeting, 
the number of passengers carried between certain stops on a 
railroad train, may be precisely determined. In some very 
large aggregates perfect accuracy may not be worth the time and 
money it would cost; approximate accuracy may suffice. But 


1See Statistical Journal, September, 1908, p. 466. 
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perfect accuracy in the sense of an absolutely complete count is 
frequently possible with counted variables. 

Data drawn from measurements present a different condition. 
No measurements are ever perfectly accurate. Precision beyond 
certain points is unattainable, and the utmost possible precision 
is rarely worth attempting. The desirable degree of accuracy 
depends upon the ease with which accurate measurements can 
be had, and upon the value of accuracy in the particular analysis 
in hand. 

Errors are of different sorts. They arise, in some instances, 
from some persistent factor, the effect of which may be deter- 
mined. An instrument out of adjustment may introduce a con- 
stant error in a certain direction. When the force of a constant 
error is known, allowance for the error can be made in the analysis 
of the data. Errors of this sort therefore do not seriously inter- 
fere with statistical analysis, save when their presence is not 
recognized or their extent cannot be ascertained. 

Errors are commonly to be divided into compensating and 
cumulative errors. Compensating errors are of such character 
that they tend to offset one another and leave the general result 
undisturbed.’ Reference has been previously made to the “‘ law 
of error’ which expresses the normal distribution of errors that 
are accidental in character. Such errors are so distributed that 
the negative errors offset the positive, and the true value of the 
variable is not materially affected. Cumulative errors, on the 
other hand, tend to pile up with the result that the value of 
the variable as recorded in the observations may be seriously off 
the true value. Cumulative errors invariably call for adjust- 
ment and correction. 

Reference is commonly made in statistical work to the probable 
error of a statistical result or coefficient. The concept of the 
probable error relates to the fluctuations which would occur in 
the statistical result or coefficient if repeated universes of the same 
sort were analyzed in the same way. The probable error is that 


1 Because of the compensating character of errors in the data, a final statistical result 
may be substantially more accurate than are the data from which it is derived. 


THE NATURE OF STATISTICAL RESULTS 379 


degree of deviation in the result above and below which any given 
result is equally likely to fall. The probable error thus is defi- 
nitely connected with the general notion of sampling, and is hardly 
applicable unless the nature of the case justifies the assumption 
of the conditions which underlie the theory of sampling. 

The concept of probability is fundamental in many lines of 
statistical work. In the interpretation of almost any statistical 
result, the question is: how far do the differences which have 
been observed and analyzed result from mere chance? Com- 
monly, there is no answer to this question, owing to the fact that 
the conditions which are supposed to obtain in the application 
of the more fundamental propositions in probability do not 
exist. Ordinarily, notions of probability are applicable to 
statistical data only when large numbers of individual cases 
have been brought together, each case having been influenced by 
a multitude of forces independent of one another and operating 
without bias on the individual items of the mass. Obviously, 
this is a condition which is not even approximated in the great 
bulk of statistical data in economic and business science. 

Conclusions from statistical analysis in the fields of economics 
and business administration do not rest at all, as a rule, on the 
traditional notions of probability. Instead, exterior tests of 
reasonableness — tests commonly of a non-statistical character 
—have to be applied. The consistency of results obtained in 
different inquiries is important. Analysis not infrequently fol- 
lows the lines of specific analogy. The universe is assumed to 
be orderly in character and to yield similar results under similar 
conditions. In this connection a quotation may well be made 
from Warren M. Persons’ Presidential Address at the Eighty-fifth 
Annual Meeting of the American Statistical Association : 

“The recognition of this belief in a general orderliness of affairs is necessary 
if we are to understand the manner in which the statistician moulds his 
investigation and arrives at a statistical inference. In the first place, he 


follows, as well as his material allows, the methods of the experimental 
scientist when he selects, as a basis for forecasting, a past period for study 


1“Some Fundamental Concepts of Statistics,” Journal of the American Statistical As- 
sociation, March, 1924, pp. 4-5. 
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as nearly as possible like that of the present. He attempts to find a specific 
analogy existing in an orderly universe. But he realizes that analogies 
differ greatly in their persuasive quality. The importance for an inference 
of a given statistical result pertaining to a given period is greatly increased, 
first, if similar or consistent statistical results obtain for sub-periods ; second, 
if similar or consistent statistical results obtain for other periods and under 
different circumstances; and third, if all of the statistical results agree with, 
are supported by, or can be set in the framework of, related knowledge of a 
statistical or non-statistical nature. To illustrate the first point, if we have 
found, for instance, that a periodic function with a period of 40 months fits 
a time series of money rates for a span of 50 years the conclusion that there 
is a real period of 40 months for the entire span is strengthened if we obtain 
the same periodic function for each of two or more segments of the given 
50 years. Also, to illustrate the second point, the conclusion is further 
strengthened if the same function is found for other than the given 50-year 
span and its segments. Likewise, the conclusion of a 40-months period 
would be further supported by the securing of evidence, statistical or other- 
wise, of corresponding fluctuations in business affairs. In other words, 
stability of statistical results and agreement with non-statistical results are 
potent arguments for continued stability in an orderly universe.” 


The theory of probability is hardly applicable to most economic 
data. Professor Persons states the point as follows: ! 


“There is a special objection to the application of the theory of probability 
to the particular economic data which constitute our material. If the theory 
of probability is to apply to our data, not merely the series but the individual 
items of the series must be arandom selection. In fact, a group of successive 
items with a characteristic confirmation constitutes our material. Since the 
individual items are not independent, the probable errors of the constants 
of a time series, computed according to the usual formulas, do not have their 
usual mathematical meaning. Thus, the ‘probable error’ of 0.03 in a 
coefficient of correlation of + 0.75 between the monthly items of pig-iron 
production and money rates six months later does not indicate, as one would 
conclude from the theory of probability, that the chances are billions to one 
against the independence of the two variables; or, to state the idea more 
specifically, that if we compute a coefficient from data of ‘any’ other actual 
period the chances are more than ten millions to one that its value would be 
over + 0.50. In fact, the significance of the ‘probable error’ of a constant 
computed from time series is not known. 

“The thesis that statistical probabilities give us no aid in arriving at a 
statistical inference has been developed with great skill and, I think, success 
by John Maynard Keynes in his ‘Treatise on Probability,’ which I have 
already quoted. Summarizing his position, he says, ‘In order to get a good 


1“ Some Fundamental Concepts of Statistics,’ Journal of the American Statistical As- 
sociation, March, 1924, pp. 7-8. 
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scientific argument we still have to pursue precisely the same methods of 
experiment, analysis, comparison, and differentiation as are recognized to be 
necessary to establish any scientific generalization. These methods are 
not reducible to a precise mathematical form. ... But that is no reason 
for ignoring them, or for pretending that the calculation of a probability 
which takes into account nothing whatever except the numbers of instances, 
is a rational proceeding. .. . Generally speaking, therefore, I think that 
the business of statistical technique ought to be regarded as strictly limited 
to preparing the numerical aspects of our material in an intelligible form, so 
as to be ready for the application of the usual inductive methods. Statis- 
tical technique tells us how to ‘‘count the cases” when we are presented 
with complex material. It must not proceed also, except in the exceptional 
case where our evidence furnishes us from the outset with data of a particular 
kind, to turn its results into probabilities; not, at any rate, if we mean by 
probability a measure of rational belief.’ ” 


The discussion of conditions affecting the conclusions of sta- 
tistical investigation must have made it evident that such. con- 
clusions are always conditional in character. Statistical results 
may create presumptions in favor of certain conclusions; they 
rarely dispose of matters completely. They serve to simplify 
the situation; possibly to remove from the range of reasonable 
expectation certain factors previously regarded as important. 
But statistical analysis rarely, if ever, entirely eliminates un- 
certainty as to the factors involved in any situation. Statistical 
results merely afford a more accurate and precise description of 
the elements of an object or situation. On the basis of available 
data, statements may be made as to certain probabilities; but 
these are not probabilities in the sense of the general theory of 
probability. They are inferences which have apparently a better 
chance of validity than other inferences which may be entertained 
on the same subject. Statistical results, in other words, almost 
invariably leave room for some divergence of opinion and for 
the entertainment of various hypotheses as to the exact relation- 
ship of the factors under investigation. Statistical results are 
rarely, if ever, positively conclusive. 

Great care consequently should be exercised in the interpreta- 
tion of statistical material, that any generalizations may be kept 
strictly within the limits of the evidence. Sometimes conclusions 
are drawn which are manifestly inconsistent with the data. More 
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frequently, however, conclusions are arrived at which are not 
necessarily the only conclusions, nor the most probable conclu- 
sions, to be drawn from the data. Exercise of judgment based 
upon wide familiarity with the situation, and great caution in the 
statement of conclusions from the data, are the beginning of 
wisdom in statistical work. 


AGL EIN OCH ee Onl INCLUSIVE 


APPENDIXA + 


THE COLLECTION OF PRIMARY DATA 


Tue collection of statistical data may be from either primary 
or secondary sources. Secondary sources consist of published or 
documentary materials in which the statistical data have already 
been assembled. Some of the problems involved in the use of 
such secondary sources are considered in Appendix B. In this 
Appendix A attention is directed rather toward some of the con- 
crete problems involved in the assembling of data not previously 
made available.” 

The first question to be considered in the collection of primary 
data is the nature of the information it is desirable to obtain. 
Every statistical inquiry must start with a definite objective. 
This does not mean that there are to be preconceptions before 
the inquiry is made; investigation should proceed with an open 
mind. But scientific inquiry gets its direction from working 
hypotheses; otherwise it is bound to proceed blindly. In the 
collection of primary statistical data some definite idea of the 
information to be sought is necessary from the outset.’ 

As will be seen shortly, the information actually to be sought 
is dependent upon many different considerations. The cost of 
obtaining the data commonly has an important bearing upon the 
scope of the inquiry. Large-scale enumerations are expensive, 
and it is necessary at times to limit the scope of the inquiry merely 
to keep within the limits of available funds. Furthermore, no 
inquiry should be pushed beyond the point at which the trouble 
and expense required to obtain the data are no longer justified 


1 This appendix may be profitably used with Chapter ITI. 

2 The more general and abstract considerations underlying original observation are dealt 
with elsewhere (see Chapter III). 

3 Tt is often helpful to draft skeleton tables so that there may be perfectly definite ideas of 
the forms the assembled data should take. 
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by the value of the information secured. Still other limits to the 
scope of an investigation are set by the personnel of the organi- 
zation making the inquiry, — by the character, education, intelli- 
gence, and technical expertness of the force. Again, the inca- 
pacity of informants often imposes serious restrictions. Finally, 
the nature of the subject matter itself has a great deal to do with 
the scope of the inquiry. The data to be collected must be ob- 
jective, concrete; they must be relevant, significant. While 
considerations such as these raise problems which can be better 
dealt with after the technique of analysis has been examined, 
they are to be kept in mind at the very start of statistical in- 
vestigations in connection with the examination of the factors 
determining the scope of original inquiry. 

Definite decision regarding the nature of the information to 
be sought in a particular investigation involves a canvass of the 
possible sources from which to draw the information. It may be 
best to secure the information indirectly through some organiza- 
tion already in touch with the situation or having some easy 
approach to those who are best in position to furnish the data. 
Commonly, however, it is preferable to go directly to the original 
sources, that the information may pass through as few hands as 
possible and may be furnished directly by those who are in the 
best position to safeguard its accuracy and validity. 

When the sources to be canvassed have been definitely selected, 
the method of actually securing the data must be determined. 
Here two fundamentally different processes are available. One 
calls for direct personal approach to the informant, and is known 
as the field method. The other involves a communication in 
writing without personal contact, and is referred to as the cor- 
respondence method. The two methods raise very different 
problems of organization and procedure, and subject the inquiry 
to essentially different conditions. Some of these differences 
are to be further examined.1 


1 Sometimes statistical data are collected by the analyst himself. This is usually the 
case, of course, with work in the physical sciences. Sometimes the entering of the original 
records is made a purely mechanical process through the use of self-recording instruments. 
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Field investigations take two different forms: first, those in 
which the contact is made by temporary enumerators employed 
for a short time without any considerable preparation for the 
work; second, those in which the canvass is made by special 
agents working ordinarily for longer periods, and in any case es- 
pecially qualified for the inquiry. Naturally, if temporary enu- 
merators are employed, the scope of the canvass must be kept 
within much more simple and narrow limits.1 When special 
agents are employed, more technical ground may be covered. 
These agents ordinarily work through more or less extended in- 
terviews with the informants, and are supposed to be conversant 
with the various technical problems involved in the inquiry. 
They greatly extend the possible scope of statistical investigation. 

Obviously, when enumerators are employed, the selection and 
specific training of the force has an important bearing upon the 
quality of the facts which may be gleaned from the statistical in- 
quiry. The force should be chosen with as much care as condi- 
tions permit. The recruiting of the thousands of enumerators 
required for the Federal census of population has to be, of neces- 
sity, a rough-and-ready process. In earlier years, appointments 
were made locally and not uncommonly with political considera- 
tions partly in mind. More recently, formal written examina- 
tions have ordinarily been employed. Every effort that is made 
to secure as good a working force as the circumstances permit 
is rewarded in the greater dependability of the returns. 

The correspondence method of collecting data avoids the prob- 
lem of recruiting the field force, but introduces other compli- 
cations of an even more serious character. ‘The difficulty is to 
obtain the intelligent codperation of the informant with only 
a written communication to enlist his interest. When the cor- 
respondence method is used continuously to obtain information 
at regular intervals from informants who are kept in touch with 
the central office, satisfactory results are not uncommon. Here 


1 When the temporary enumerator is employed, blanks may be sent to the informants in 
advance of the visit of the enumerator and simply checked over and collected by the 
enumerator upon his visit; or the blanks may be filled in by the enumerator himself in 
connection with questions put orally to the informant. 
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the interest of the informant may be cultivated over a consider- 
able period with the result that various incentives may be brought 
into play. When, however, the correspondence method is em- 
ployed in some isolated inquiry, the difficulties are very great. 
Unless the inquiry can be confined to narrow limits and simply 
stated, or the interest and willing codperation of the informants 
can be enlisted in some unusual fashion, the results are very 
likely to be unsatisfactory. The correspondence method is to be 
employed cautiously and only under conditions which appear 
definitely to lend themselves to its use. 

Instructions to those who are to supply the data are necessary 
under both the field and the correspondence methods. Under 
the field method, it is the duty of the enumerator or agent to see 
‘that the informant gives the information in accordance with the 
wishes of the collecting office, hence instructions are placed ordi- 
narily in the hands of the enumerator or special agent. Under 
the correspondence method, the instructions have to be trans- 
mitted to the informant in connection with the communication 
requesting the data. In this case, the instructions must be in 
perfectly explicit and unambiguous form. Furthermore, they 
must be sufficiently simple and brief not to deaden interest nor 
to excite hostility. Instructions to field enumerators, on the 
other hand, may be made considerably more detailed and intri- 
cate; and those to special agents may be of a highly technical 
character. 

Perhaps the most important step of all in organizing the col- 
lection of primary data is the preparation of the necessary forms, 
or schedules. These vary widely with the method of collection, 
the character of the informant, and the nature of the ground to 
be covered by the investigation. The essentials of a satisfactory 
statistical schedule are perhaps most clearly seen in forms pre- 
pared for inquiries through correspondence since here the diffi- 
culties are most serious. The requirements of the schedule in 
this case are to be carefully examined. 

The fundamental problem in the preparation of the schedule 
is the determination of the number, form, and content of the 
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questions to be put. In general, of course, the number of ques- 
tions should be as small as is consistent with the purposes of the 
inquiry. It must not in any case be made so long as to irritate, or 
discourage, or antagonize the informant. With regard to the 
form and content of the questions, the statement of Bowley has 
been widely quoted : 

‘ . . The questions must be so clear that a misunderstanding is impossible, 
and so framed that the answers will be perfectly definite, a simple number, 
or ‘yes’ or ‘no.’ They must be such as cannot give offence, or appear in- 
quisitorial, or lead to partisan answers, or suppression of part of the facts. 
The mean must be found between asking more than will be readily answered 
and less than is wanted for the purpose in hand. The form must contain 
necessary instructions, making mistakes difficult, but must not be too com- 
plex. The exact degree of accuracy required, whether the answers are to 
be correct to shillings or pence, to months or days, must be decided. Every 
word and every square inch of space must be keenly criticised. A little 
trouble spent upon the form will save much inconvenience afterwards.” ! 


Considerations such as are here set forth are to be carefully 
observed in the formulation of all questionnaires or schedules 
the filling in of which must be left in inexpert hands. 

The physical form of the schedule or questionnaire is highly 
important. Commonly, the schedule will start with some simple 
preface, or explanation of purpose. This should give the reasons 
for the inquiry, and seek to enlist the interest and codperation 
of the informant.2 The preface, as well as the schedule as a 
whole, should be as brief as possible. The schedule should afford 
definitions for all terms which are not self-explanatory. The 
units to be employed should be carefully indicated and, when- 
ever possible, should be of a familiar sort. Definitions of terms 
and units may well be placed in close proximity to the entries to 
which they relate. Computations should not be expected of the 
informant. Filling in the questionnaire or schedule should be 
made as easy as possible by simplicity of arrangement, and sepa- 
ration of different parts of the form by appropriate rulings. By 
allowing the informant to give answer merely by placing a check- 

1 Elements of Statistics, p. 15. 


2 The authority of the collecting agency may be cited, but as a rule actual compulsion is 
to be regarded as a last resort in the collection of statistical data. 
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mark against alternative answers or by crossing out part of the 
printed matter and leaving the balance, the task of supplying 
the desired information may commonly be made much easier. 
This same object may be served also by making every provision 
for a safe and easy forwarding of the returns. This is ordinarily 
best accomplished by providing stamped self-addressed return 
envelopes or containers in which the forms may be mailed or 
shipped with the least possible effort. In every practicable way, 
the task of the informant should be reduced to the smallest possi- 
ble proportions. 

The success of practically all field inquiries depends upon the 
goodwill of the informants. The willingness to codperate is to 
be cultivated by all practicable means. Assurance of confidential 
treatment of the returns is usually necessary to this end. All 
sorts of organizations may be employed to broadcast informa- 
tion regarding the purpose and importance of the investigation. 
Newspaper publicity is often a valuable aid. In general, every- 
thing possible should be done to create a favorable attitude on 
the part of those who will be called upon to give the desired in- 
formation. 

The final step in the collection of primary data is the editing 
of the returns, and the making of any necessary follow-up. The 
returns, by whatever method obtained, require checking upon 
receipt at the office. This checking covers such points as neces- 
sary accuracy, consistency, uniformity, completeness.! Some of 
the returns may appear so defective as to warrant no further 
consideration and may have to be discarded at once. Others, 
though somewhat defective, may invite further treatment. 
Through the field force, or by further correspondence with the 
informants, missing items may be supplied, or apparent incon- 
sistencies or inaccuracies corrected. In general, a careful audit 
of the returns improves the quality of the data sufficiently to 
make the work a highly important phase of the general process 
of original collection. 


1Jn connection with the editing, code numbers may be placed against the entries in the 
schedules, that the subsequent tabulation of the data may proceed expeditiously. 
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THE PRELIMINARY EXAMINATION OF SECONDARY 
DATA 


STATISTICAL analysis has to deal frequently with data collected 
and tabulated in connection with some previous investigation. 
Under these conditions analysis has to be more largely adapted 
to the character of the data than is the case when collection and 
tabulation are carried through with the objectives of analysis 
fully in view. Consequently, when published, or secondary, 
material has to be utilized, a preliminary examination, designed 
to ascertain the character of the data, is desirable. In this pre- 
liminary examination, no attempt at interpretation need be made ; 
interpretation had best await the completion of analysis. But 
certain factors which may affect the course of analysis merit 
attention at the start. These factors are of relatively simple 
character and may be disposed of satisfactorily without full 
knowledge of the technique of analysis. 

In the first place, the source of the data should be carefully 
noted. Both the authority originally responsible for the data 
and the publication or record from which the data have been 
actually drawn should be indicated. The reference should be 
definite and detailed; at least sufficiently so to permit direct 
verification of the material by any intelligent reader. If the data 
are to be found in several different publications, this fact may well 
be stated. In the case of current data, the dates of publication 
should be noted. In general, a clear indication of sources should 
always be made when secondary data are employed. 

Sometimes secondary material is accessible in some one pub- 
lication before it is in others. In this case, the publication fur- 
nishing the data most promptly should be mentioned. Dispatch 

1 This appendix may be used with Chapter III after Appendix A. 
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in the dissemination of accurate statistical information is some- 
times an important contribution to effective analysis. 

Sometimes the first published data are provisional in char- 
acter and avowedly subject to later revision. When this is the 
case, the fact should be recorded. Furthermore, the period 
ordinarily elapsing between the appearance of the preliminary 
and final figures and the extent of the usual revision should be 
stated, if possible. In the analysis of current data, these consid- 
erations are sometimes highly important. 

In tracing and describing the sources of data, it is desirable 
to ascertain the period for which the data in question are availa- 
ble. This frequently raises serious questions concerning the 
continuity of given series. Unlike units may have been employed 
at different periods. Different titles may have been attached 
to the same series at different times. More probably, the same 
title may have been employed for series essentially unlike at 
different times. This lack of continuity in data is one of the 
most serious complications in statistical analysis. It receives 
attention elsewhere in the present study. In the preliminary 
examination of secondary data, it is usually enough to note any 
lack of continuity, and to describe carefully the nature of any 
changes in the composition of the series. 

In taking data for analysis, some- indication should be given 
of the way in which the data have been originally collected. 
Have the informants been canvassed by correspondence or in 
person? If by the former method, what forms have been em- 
ployed and what system of follow-up used? If informants have 
been solicited personally, have they been visited by special agents 
or by ordinary enumerators? What schedules have been em- 
ployed in the canvass? Information along these lines is com- 
monly difficult to obtain unless those analyzing the data have 
been in touch with their collection. Happily there is a growing 
tendency among the best statistical offices to publish with statis- 
tical records a full account of the means by which the data have 
been collected. When such information is not volunteered, it 
should be sought if an exhaustive analysis of the data is to be 
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undertaken. An understanding of the methods of collection 
frequently affects the interpretation of the results of analysis. 

Sometimes the data are drawn from estimates instead of re- 
ports. In this case, it is especially important to ascertain the 
method by which the estimates have been made. If such in- 
formation is not forthcoming, the data are to be used with the 
utmost caution, and only when supported by other material 
drawn from more clearly accredited sources. 

The preliminary examination of data should entertain a num- 
ber of other questions concerning the significance of the data. 
The unit or units in which the variables are expressed should be 
carefully noted. The limits of time and space covered by the 
data must be explicitly stated. The attributes in terms of which 
the objects of inquiry are specified should be most carefully as- 
certained. The measure of detail of time, location, and attri- 
bute with which the items have been reported should be recorded. 
Matters of this sort cannot be too thoroughly covered before 
analysis is begun. 

Still another point is to be examined if the data are in the form 
of derivatives. Such figures must have been obtained from 
others. From what others? And by what process? Unless 
these questions can be satisfactorily answered, gross error may be 
concealed behind a series which, upon its face, seems clearly 
acceptable. 

Finally, it is desirable in the preliminary examination of data 
to get some notion of the accuracy and completeness of the data. 
To what degree of refinement have the original observations been 
carried? Are there any evidences of bias, or willful falsification ? 
To what extent are all cases —the entire “‘ population’ —included 
inthe returns? Ifsampling has been resorted to, may the sample 
be presumed to be adequate in size, devoid of bias, and, in general, 
truly representative? Questions of error and accuracy cannot 
be fully dealt with until analysis has run its course, but all availa- 
ble information on these subjects may well be brought together 
in the preliminary stages of the study. 

The points which have been discussed by no means exhaust 


392 STATISTICAL ANALYSIS 


the list of those that may be considered in connection with a 
preliminary examination of data. But many of the relevant 
considerations are satisfactorily entertained only when the na- 
ture and scope of statistical analysis are fully understood. They 
may well be put aside for the present, therefore, and developed 
more fully in connection with the later study of problems 
encountered in the interpretation of the results of statistical 
analysis. 


APPENDIX C4 


THE MECHANICS OF TABULATION 


THE technique of tabulation varies with the size of the group 
to be counted. When the number of cases is relatively small, 
tabulation is most easily accomplished in one of two simple ways: 
(1) the direct sorting and counting of individual returns; (2) the 
entering of individual cases on tally sheets and the subsequent 
counting of these entries. These two rudimentary methods 
may be briefly described. 

When the returns from original observation are made on in- 
dividual cards, each bearing the record of a single case, there 
may be no more expeditious way of tabulating the results than 
by simply sorting the cards and then counting the number in 
each stack. Though this procedure involves more wear and 
tear on the original returns than is sometimes desirable, it may 
occasionally be followed to advantage. Before the days of me- 
chanical tabulation the method was used even in large-scale 
enumerations.”. It may still be satisfactorily used when the 
amount of tabulating to be done is not large, and returns have 
been made, perhaps with a view to this method of tabulation, 
on the plan of a separate card for each case. 

Instead of sorting and counting the original returns directly, 
tabulators may make use of the tally sheet.2 The tally sheet 
is really a blank table bearing captions and stub-items indicating 
the totals to be counted. Each individual case is checked into 
the proper compartment of this table. When all the cases have 
been entered, the number of check-marks in each compartment 


1This appendix may be used with Chapter IV. 

2 A slight modification was sometimes introduced through the transfer of data in some 
cases to small cardboard chips which were counted in place of the original returns. Various 
colors and symbols on these cards facilitated transcription and the count. The method was 
referred to as the chip system of tabulation. 

3 Direct sorting of the returns on individual items is impossible, of course, when the re- 
turns come in on larger forms each carrying the records of a considerable number of cases. 
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is read and recorded.!. In Table 124 an illustration is given of 
a tally sheet on which a tabulation has been completed. 

The mechanics of tabulation are quite different when the num- 
ber of cases to be counted is large — say, several hundred — or 
when numerous independent tabulations are to be made of a 
smaller number of cases. Under these conditions mechanical 
methods are in order. Different types of equipment for this 
purpose are available.? All are electrically operated and all 


TABLE 124. FREQUENCY DISTRIBUTION OF THE PERCENTAGE OF MEMBERS 
UNEMPLOYED JULY 1, 1920, IN 120 REPORTING TRADE UNIONS 


Percentage Number of Unions 
Unemployed Reporting 
0.0 - 0.9 Soune 5 
10-19 LEE FEF FEE FE E88 
i202 Ste i a eo ie a Se ke 37 
3.0 - 3.9 HA FA 1 17 
4.0 - 4.9 EEE HE 10 
5.0 - 5.9 HHA | / e 
6.0 - 6.9 yohh 5 
7.0 - 7.9 / il 
8.0 - 8.9 / 1 
9.0 - 9.9 / 1 
All Classes 120 


depend upon the transfer of the statistical data to perforated 
cards, each case being given a separate card.* It is unnecessary 
to explain here the operation of the different machines; this is 
best done in conjunction with practice work on the machines 
themselves. A brief explanation may well be undertaken, how- 
ever, of the principles underlying. the transfer of data to the 
perforated card, for the card is really the limiting factor in me- 
chanical tabulation. 

The punch card, so-called, when unperforated — that is, when 


1 The counting is facilitated if the check-marks have been entered in one of the two 
following ways: 


38 
(1) 4 HY- (2) >< Method (1) enters the fifth check mark across the pre- 
xx 


ceding four. The marks may then be counted in mul- 
tiples of five. Method (2) enters the tenth check-mark 
as an X over the preceding nine. The total may then be easily secured by counting these 
groups of ten. 
? The most widely employed are the installations of the Tabulating Machine Company 
and of the Powers Tabulating Machine Company. 
’ Machines for punching the cards are included in the standard equipment. 
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statistically blank or empty — carries a number of columns of 
digits running from otog. As many as 45 of these columns may 
be placed on a card.1. Obviously, the digits in any two columns 
may be used to represent the numbers from o to gg, inclusive ; 
in any three columns, to represent the numbers from o to 99q, in- 
clusive; and so on. Any statistical fact can be entered on the 
card, therefore, if the fact can be represented by anumber. In 
the transfer of statistical data to the card the first step is to devise 
a scheme under which every item to be tabulated may be given 
numerical expression. 

Many statistical data are initially recorded in numerical form. 
Thus, if original observation has included the age of numerous 
individuals in a given group, the age returns to the nearest year 


1The appearance of both a blank and a perforated card is shown below. The upper 
left-hand corner of the cards is usually clipped so as to facilitate detection of a card im- 
properly placed in a stack. 


00000000000 0000000000 
111 14.1 4 11797721 
2222222 2222222 
3333333 99938333 

Ph LLG NIELS LM yer rrr 
5555555 555555 
6666666 66668666 
gz 7 TTT, TAT, 77777777777 
888888888 8888888 
999999992 9999999 


oh atk ack Rio n= 
© @ONIO ao biw vn 

© © WiO o@ Alo Det 
© © N!IO o bio Dn = 


= 4 2 4 y. 5 Steaming]... 
Number Plas | Type  W. Te Gros Ne 8 | Sorvico] Kudius | Spee 


0000 000 


88-8 @o/a o]a ola & 6/8 8 8/8 8/8 81s 
9999919919 9]9 9919999199195 
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are readily transferred to the card in any two chosen columns.’ 
The actual entry is made by punch holes in the card. Thus, if 
holes are punched at 2 in the left-hand of two columns and at 5 
in the right-hand of the two, an age of 25 years is represented on 
the card. Data given originally in numerical form are thus 
easily entered in perforations on the punch card. 

Data not in numerical form have to be coded into such form. 
In this work, the decimal scheme of classification should be em- 
ployed whenever possible. Figures in the same column should 
represent codrdinate subdivisions; figures in columns to the 
right, subordinate groups. <A simple concrete case will serve to 
illustrate the principle. 

Suppose it is necessary to transfer to the punch card statistical 
items referring to the different states of the union. Since there 
are only 48 states it is entirely possible to assign the individual 
states consecutive numbers from 1 to 48 and to represent any 
one of the states with its appropriate code number in a two- 
column field on the punch card. But it appears likely that the 
customary groupings of the states — into New England, Central 
Atlantic, South Atlantic, etc. — will be employed in tabulations 
of the data. It is desirable, therefore, to have the code so set 
up that the items relating éo the states in these different groups may 
be at any time segregated. With this object in view, it is best 
to draw the two-column code so that numbers in the left-hand 
column refer to the groups of states, and numbers in the right- 
hand column to individual states within these groups. If, for 
example, the New England States as a group are designated 1 
in the tens column, the individual states within this group — 
Maine, New Hampshire, Vermont, Massachusetts, Rhode Island, 
and Connecticut — may be given the code numbers 11, 12, 13, 14, 
15. States in the Pacific Coast section may similarly be given 
the numbers: Washington, 71; Oregon, 72; California 73; — 
the group designation being 7 in the tens column. With this 
scheme of coding, it is possible to segregate all the items relating 


1 Any given number of columns set aside for the representation of a single item in the 
returns is usually referred to as a “‘field”’ on the card, 
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to any given geographic division by sorting the cards for just one 
digit in a single column — the left-hand one of the two. 

The extent to which this principle may be carried is illustrated 
in a three-column code designed to give a number to every port 
recognized in the shipping world. The code in this case starts 
with a subdivision of the globe into principal areas, or regions, 
as follows: Atlantic North American, Pacific North American, 
Middle American, South American, European, Mediterranean, 
South and West African, Indian Ocean, East Asian, Australasian. 
These principal regions are assigned numbers from o to g in the 
left-hand of the three columns. The principal regions are in 
turn subdivided into districts, which are represented in each 
region by numbers running from o to g in the second column of 
the code. The individual ports within each district are repre- 
sented by numbers running from o to g in the third column of 
the code. On this scheme it is possible to indicate any port in 
any part of the world by some number between 0 and ggg.!_ Fur- 

1 The details of the code are indicated in the following partial list of code numbers. 

Cope oF Ports 


100 ATLANTIC NORTH AMERICAN REGION 


110 Canadian District (including the 130 Maritime Provinces District (including 


province of Quebec and New- New Brunswick, Nova Scotia, and 
foundland) Prince Edward Island, Canada) 
110 ©All other ports 130 All other ports 
tir Montreal, Quebec 131 Halifax, N.S. 
112 Quebec, Quebec 132 Sydney, N.S. 
113. Three Rivers, Quebec 133, Louisburg, N. S. 
114 Tadousac, Quebec 134 Pictou, N.S. 
IIS 135 Yarmouth, N. S. 
116 All other St. Lawrence River ports 136 St. John, New Brunswick 
117 St. Johns, Newfoundland 137 Chatham, New Brunswick 
118 Grand Bay (Port-aux-Basques), 138 Moncton, New Brunswick 
Newfoundland 139 Charlottetown, P. E. Is. 
119 
140 Northern New England District (in- 
120 Great Lakes District cluding Maine and New Hamp- 
z20 All other ports shire) 
121 Buffalo, N. Y. 140 All other ports 
122 Cleveland, Ohio 141 Portland, Maine 
123 Detroit, Mich. 142 Portsmouth, N. H. 
124 Chicago, Ill. 143 Bangor, Maine 
125 Duluth, Minn. 144 Bath, Maine 
126 Milwaukee, Wis. 145 Belfast (Searsport), Maine 
127 Toledo, Ohio 146 Rockland, Maine 
128 Toronto, Ontario, Canada 147 


120 
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thermore, it is possible at any time to segregate the cases relat- 
ing to any particular major region by sorting the cards on the 


150 Southern New England District (in- 


I50 
esc 
I52 
153 
154 
155 
150 
157 
158 
159 


cluding Massachusetts, Rhode 
Island, and Connecticut) 
All other ports 

Boston, Mass. 

New Bedford, Mass. 
Fall River, Mass. 
Newport, R. I. 
Providence, R. I. 

New London, Conn. 
New Haven, Conn. 
Bridgeport, Conn. 


160 New York District (including the port 


160 
161 
162 
163 
164 
165 
166 


170 Middle Atlantic District 


170 
171 


172 


of New York and associated ports 
in New Jersey) 

New York Harbor (not classified) 
Manhattan, N. Y. 

Staten Island, N. Y. 

Brooklyn, N. Y. 

Hoboken, N. J. 


ports in New Jersey not contiguous 
to New York and the ports of Penn- 
sylvania, Delaware, Maryland, and 
Virginia) 

All other ports 
Philadelphia, 
Chester) 
Norfolk, Va. (includes Portsmouth, 
Berkley, and Newport News, Va:) 


Pa. (Camden and 


(ncluding’ 


73 
174 
175 
176 
177 


180 


180 
181 
182 
183 
184 
185 
186 
187 
188 
189 


190 Gulf District 


190 
IQI 
192 


103 


194 
195 
196 
107 
198 
199 


South Atlantic 


Baltimore, Md. 
Wilmington, Del. 


District (including 
North Carolina, South Carolina, 
Georgia, and eastern Florida and 
Bermuda Islands) 

All other ports 


Wilmington, N. C. 
Georgetown, S. C. 
Charleston, S. C. 
Savannah, Ga. 

Brunswick, Ga. 
Jacksonville, Fla. 
Hamilton, Bermuda Is. 

All other Bermuda Is. ports 


(including western 
Florida, Alabama, Mississippi, 
Louisiana, and Texas) 

All other ports 

New Orleans (Port Chalmette), La. 
Galveston (Pt. Bolivar, Freeport), 
Texas 

Sabine (Beaumont, Port Arthur), 
Texas 

Mobile, Ala. 

Pensacola, Fla. 

Tampa, Fla. 

Boca Grande, Fla. 

Key West, Fla. 

Gulfport, Miss. 


200 Pacirric Norta AMERICAN REGION 


210 Bering Sea and Alaska Peninsula 


210 
211 
212 
213 
214 
215 
216 
207 
218 
219 


District (including all Alaska west 
of Cook Inlet, Kodiak Island, and 
Aleutian Islands) 

All other ports 

Nome, Alaska 

St. Michael, Alaska 

Other Bering Sea Ports 

Dutch Harbor (Unalaska) 

Other Aleutian Island ports 
Kodiak Is. 

Chignik Alaskan Peninsula 

Other Peninsula ports 


220 Alaskan District (including all ports 


220 
221 
220 
223 


224 
225 
2260 
2277 
228 


in Cook Inlet, Junea District, and 
Alexander Archipelago) 

All other ports 

Cordova, Alaska 

Seward, Alaska 

Juneau (including Douglas, Tread- 
well, Thane, and radius of 5 miles) 
Ketchikan, Alexander Archipelago 
Skagway, Alaska 

Sitka, Alaska 
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basis of the digits in the left-hand column, and to get further 
subdivisions of any particular region by subdividing the cards 


230 British Columbian District (including 


230 
231 
232 
233 
234 
235 
236 
237 


the entire Pacific coast of Canada) 
All other ports 

Vancouver, British Columbia 

New Westminster, British Columbia 
Prince Rupert, British Columbia 
Victoria (Vancouver Is.) 

Nanaimo (Vancouver Is.) 


240 Puget Sound District (including north- 


240 
241 
242 
243 
244 
245 
240 
247 
248 
249 


western Washington) 
All other ports 

Seattle, Wash. 
Tacoma, Wash. 
Bellingham, Wash. 
Everett, Wash. 

Port Townsend, Wash. 
Port Angeles, Wash. 
Anacortes, Wash. 


250 Oregonian District (including ports 


of southwestern Washington, Co- 
lumbia River, and other ports of 
Oregon, and all ports of California 
north of San Francisco Bay) 


250 All other ports 
251 Portland, Ore. 
Source : 


U. S. Shipping Board — Division of Opera- 
tions Tabulating Section — Office of the 
Comptroller 


252 
253 


254 
255 


256 
257 


260 Central 


260 
201 
262 
263 
264 


2605 
206 
267 


270 Hawaiian 


270 
271 
272 
273 
274 
275 


Astoria, Ore. 


Gray’s Harbor (Aberdeen and 
Hoquiam, Wash.) 

Willapa Harbor (Southbend, 
Wash.) 

Eureka, Cal. 


and Southern Californian 
District (including San Francisco 
Bay and all the coast of California 
to the south) 

All other ports 

San Francisco (Oakland), Cal. 

Los Angeles, Cal. 

San Diego, Cal. 

Port San Luis (San Luis Obispo 
Harbor) 

Monterey, Cal. 


District (including the 
Hawaiian and Midway Islands) 
All other ports 

Honolulu, Hawaiian Is. 

Hilo, Hawaiian Is. 


Midway Islands 


First Revised Issue, May 1, 1919 


Application of the same principle in an entirely different field, namely, that of the classi- 
fication of accounts, is illustrated in the code which follows: 


CopE or ACCOUNTS 


General Classification 


Operating Expenses 


100 to 119 
120 to 139 
140 to 149 
200 to 219 
220 to 239 
240 to 249 
250 to 259 
300 to 309 
320 to 339 
340 to 3490 


Fuel 


Deck Department — 


Engine Department — 


Wages 
Supplies 
Expenses 
Wages 
Supplies 
Expenses 


Steward’s Department — Wages 


Supplies 
Expenses 
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of that region on the basis of the digits appearing in the second 
column. The code thus serves two purposes: original transfer 


350 to 359 Food 

450 to 4909 Cargo Expenses 

550 to 599 Port Expenses 

650 to 699 Other Voyage Expenses 

750 to 799 General Expenses 

800 to 849 Maintenance 

850 to 859 Incidental Transportation Expenses 
860 to 864 Charter Expenses 

865 to 869 Suspense 


General Ledger 
870 to 949 General Ledger Accounts 


Operating Revenues 
950 to 999 Revenues 


DETAIL CLASSIFICATION OF ACCOUNTS 


Operating Expenses 


Deck Department Engine Department 
I0O0 WAGES 220 SUPPLIES 
tor Officers 221 Lubricants, Oil, Grease, Graphite, 
102 Petty Officers Tallow 
103 Seamen 222 Water 
104 Cadets 223 Packing 
1o5 Wireless Operators 224 
106 Messmen and Mess Boys 225 Hardware, Tools, and Gauges 
107 Carpenters and Electricians 226 Paints and Paint Brushes 
108 227 Cleaning Articles 
109 228 Chemicals 
rir Bonus 229 Containers 
t12 Overtime 230 
113 Board Allowance 231 Electrical and Oil Fixtures 
II4 232 Waste and Curled Hair 
115 233 
234 Pipe Fittings — Valves, Couplings 
235 


236 Grate Bars 
237 Fire Bricks, Cement, and Clay 
238 Other Supplies 


I20 SUPPLIES 
121 Nautical Instruments, Charts 


122 , 230 
123 Ropes — Wire and Manila 
tae Canvas 240 EXPENSES ; 
125 Hardware and Tools 241 Scaling of Boilers 
126 Paints and Paint Brushes eA? 
127 Cleaning Articles 243 Removal of Ashes 
128 Flags, including Signal Flags eae 
129 Containers 245 
I30 250 FUEL 
131 Lamps — Electrical and Oil 251 Coal 
132 Oils for All Deck Uses 252 Cost of Coaling 
133 Marlin Spikes, Shackles, Tackle 253 Fuel Oil 

Blocks and Falls 254 Fueling 
134 255 Cost of Steam 
135 2560 


138 Other Supplies 257 
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of the data to the punch cards, and easy subsequent sorting 
of the cards. 

Perhaps the most difficult problem to deal with in transfer- 
ring data to the punch card is the arrangement of fields. Since 
the number of columns on the card is limited, the number which 
can be allowed for any selection of data is commonly less than 
might be conveniently employed. It follows that certain de- 
tails in the returns have to be omitted from the punch card, and 
hence from subsequent tabulations. At times, it is very difficult, 
in advance of subsequent analysis, to determine just what in- 
formation to include on the card, and what to omit. 

Once the statistical data have been transferred to the punch 
cards, and the perforations verified, tabulation of the data is ac- 
complished mechanically. The machines operate on this general 
principle: when perforated cards are passed through the 
machines, electric circuits are completed through the perforations, 
the current thus passed being made to register on dials in the 
machines. Before any counting is done, the cards are ordinarily 
sorted on the sorting machine. With this machine, cards may be 
put in consecutive order on the basis of the numbers punched in 
any particular field, and cards bearing identical punches in any 
particular field may be brought together. With the tabulating 
machines it is then possible to count the number of cards in 
any particular group, or to total the numbers represented in the 
punches in any particular set of columns. Technical proficiency 


Steward’s Department 306 Messmen and Mess Boys 
307 
300 WAGES 308 
301 Steward and Storekeeper 31r Bonus 
302 312 Overtime 
303 Cooks 313 Board Allowance 
304 314 
305 Cabin Boys SEs 


Source : 

U. S. Shipping Board — Division of Opera- 
tions Tabulating Section — Office of the 
Comptroller 


1 This principle was first perfected mechanically by Hollerith. Machines built on the 
principle were first employed on a large scale in connection with the United States Census of 


1890, 
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in the operation of the machines consists, of course, in obtaining 
the largest possible number of tabulations with the smallest 
possible number of runs of the cards. 

In both sorting and tabulating machines, the cards are fed 
into the machine automatically. Sorts and counts may be made 
at surprising speed. Thus, the sorting machine will sort as many 
as 250 cards a minute; the tabulating machine will pass as many 
as 150 cards per minute. A very great saving of labor results. 
Transfer of statistical data to the punch card, however, requires 
considerable time and is hardly worth while unless the number 
of cases is large, or repeated tabulations are likely. But when 
either of these latter conditions prevails, mechanical tabulation 
represents an extraordinary saving; in fact, without mechanical 
‘tabulation many of the large-scale statistical enumerations of 
the present time would be entirely impracticable, if not a physical 
impossibility. 
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RULES FOR THE “CONSTRUCTION~ OF STATISTICAL 
TABLES ? 


THE progress of every art should be marked by the accumula- 
tion of an increasing stock of generally accepted practices. As 
these practices obtain common approval, they should be recog- 
nized as standard and regularly followed until more satisfactory 
methods are discovered. In statistical exposition, the standardi- 
zation of graphic methods has been one of the gratifying advances 
of recent years. ‘The question may well be raised: To what ex- 
tent has there been, and to what extent are there further oppor- 
tunities for, a similar standardization of practice in the methods 
of tabular presentation ? 

In considering this question, it should not be thought that 
standardization is accomplished only through the conscious adop- 
tion of rules and regulations set up by recognized organs of au- 
thority. Standardized statistical practices may evolve by im- 
perceptible degrees through the influences of imitation and pres- 
tige. This is particularly the case if some one statistical bureau 
is the fountain-head of governmental practice. The working 
rules of such an office tend to become the rules of a following of 
less influential practitioners. Standardization of this kind is 
going on at all times. Such standardization of practice as we 
have today in statistical work in this country is almost altogether 
the result of these influences of imitation and prestige. 

Unconscious standardization of this sort has already made 
substantial progress with regard to the structure of statistical 
tables. Without attempting a complete enumeration of the rules 

1 This appendix may be used with Chapter IV after Appendix C. 

2 Reprinted — with slight alterations — from the Quarterly Publication of the American 


Statistical Association, March, 1920. Reference may be made to another article appearing 
in the same periodical: G. P. Watkins, ‘‘Theory of Statistical Tabulation,” December, 1915, 


PP. 742-757. 
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observed by competent authorities, a few of the standard prac- 
tices may be noted in passing. It is generally recognized that 


(1) every table should be self-sufficing, containing within itself 
a clear explanation of the meaning of the items displayed ; 

(2) every table should be logically a unit containing only data 
which are intimately related with one another ; ? 

(3) column- and row-headings should be brief, unambiguous, 
and self-explanatory, table footnotes being used when 
necessary to make the headings perfectly clear ; 

(4) coordinate and subordinate relationships among the column- 
and row-headings should be shown by variations of 
boxing in the captions and of indentation in the stub; 

(s) varieties of letters, figures, lines, column-widths, and inter- 

. linear spacings should be employed to facilitate easy and 
intelligent use of the table ; ? 

(6) columns and rows should be lettered or numbered if cross- 
reference is desirable ; 

(7) sources and units should invariably be indicated. 


The common acceptance of these principles represents no mean 
advance in the standardization of statistical table structure. 

It is to be observed, however, that the standardization thus 
far effected concerns primarily the constituent parts of the table, 
not the table’s general form. The choice of position between 
columns and rows, the arrangement of the several columns or the 
several rows, and the location of particular columns toward the 
left of the table or of particular rows toward the top, seem still 
to be matters of individual preference, if not of chance. It is 
important to consider how far standardization of the general 
form of statistical tables is feasible and desirable. 

Standardization of the general form of statistical tables must 
begin with a distinction between general-purpose and special-pur pose 


1 Each order or level in the table, or in any section of the table, should exhibit one and 
only one complete classification, z.¢. the captions or stub items within each section should 
be mutually exclusive. 

2 Long columns of figures are more easily read if spaces are left after every five or ten 
rows.: Spaces or commas also may be employed to group the digits of large numbers — e.g. 
10,986,521 — so as to promote quick and accurate reading. 
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tables. ‘The general-purpose table is designed to bring together 
in most convenient and accessible form all the data bearing upon 
a given topic. The special-purpose table is intended to throw 
into relief relationships of special significance in a given study. 
The general-purpose table is an orderly presentation of statis- 
tical material; the special-purpose table, a record of the results 
of statistical analysis. Of course, a measure of analysis is a pre- 
requisite even of the general-purpose table, but the analysis 
is of a different order. It is the analysis essential to effective 
enumeration and tabulation, not the analysis accompanying 
specific interpretation. The analysis required for the special- 
purpose table is directed toward a particular issue. The prob- 
lems of good table structure are essentially different for the two 
types of tables. 

Since the construction of the general-purpose table is the sim- 
pler case, it will be examined first. In considerable measure, 
the general-purpose, or primary, table is a creature of the physical 
form of the medium in which it appears. Upon the one hand, the 
table tends to expand to accommodate the large body of data 
pressing for inclusion. Upon the other hand, the capacity of 
the printed page — even if it be folio — stands as a limit on the 
indefinite enlargement of the table. Tables which are allowed 
to exceed the dimensions of the page and have to be folded in 
are everywhere recognized as objectionable. Loose tables, sepa- 
rately printed in large irregular sizes, are as bad, if not worse. 
Tables running across two pages facing one another are reason- 
ably satisfactory, but are to be avoided where possible. ‘Tables 
which are presented at right angles to the text fall into the same 
class. In general, the single page, held as when reading the text, 
is the maximum size to which the statistical table should be per- 
mitted to run. Primary tables usually press upon this physical 
limit ; their outside dimensions are thusindependently determined. 

Within the table, similar influences are at work. Whether 
given arrays of data shall be exhibited in columns or in rows 1s com- 
monly a question of the difference in the vertical and horizontal ca- 
pacity of the page. The maximum number of lines in a table is 
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several times greater than the maximum number of columns. 
Consequently the arrays having the greatest number of items are 
naturally assigned to the columns, the other arrays to the rows. 
Once a given set of headings has appeared in caption- or stub- 
position, there is a strong presumption in favor of its occupying 
the same position in other related tables, for the transcription 
of data from general tables is thereby facilitated. Upon the 
whole, however, the assignment of columns and rows rests funda- 
mentally upon the greater capacity of the column — a factor 
not subject to modification by the statistician. 

A much larger measure of option may be exercised in fixing, 
in a general-purpose table, the order of columns and of rows. A\l- 
most any systematic plan may be adopted; but the most satis- 
factory arrangements are the alphabetical, chronological, geo- 
graphical, or according to the magnitude of the items. There 
are no grounds for urging the adoption of any one or two of these 
arrangements to the exclusion of the others. Now one best 
serves; now another. One rule, however, should govern the 
final selection in all cases: that order should be employed which 
keeps the details of the table most generally accessible. Readers 
will come to the table with a variety of interests. They should 
be given that table from which in general they can most easily 
draw the information they seek. .Arrangement according to 
magnitude or importance of items is less satisfactory in general- 
purpose than in special-purpose tables, because it depends upon 
analysis from a single point of view, and it is frequently unwise 
to commit the table to this particular viewpoint. The other 
arrangements better meet the variety of needs which a primary 
table is designed to serve. The important end is to secure some 
logically and commonly understood arrangement which opens 
the table to easy transcription. 

When geographical or chronological orders are adopted, a 
decision has to be reached as to what items to place at the top 
and left, and what items at the bottom and right. In the tabular 
arrangement of the states of this country, the grouping and order 
followed by the Bureau of the Census may be recognized as stand- 
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ard; the northern New England states stand at the head of the 
list, the southern Pacific states, at the foot. In general, the best 
statistical practice for this country would seem to be to run geo- 
graphical series from north to south and from east to west. With 
chronological series the case is not so clear. Upon the whole, 
however, for general-purpose tables, the Census Bureau practice 
of placing most recent dates at the top and left seems commend- 
able if there is a fair presumption that the figures of most recent 
date will be most frequently transcribed. When, however, the 
data probably will be transcribed in entirety as time series, it 
would seem preferable to place the figures for earlier dates toward 
the top and left. The rule to apply in all these cases is simple : 
the most generally useful data should be located toward the top 
and left where accurate transcription is rendered easier by close 
proximity to the column- and row-headings. 

The general or primary table exhibits no specific analysis. Its 
form is in considerable measure the resultant of the physical limi- 
tations of the page and the necessity of presenting a maximum 
body of data in a way that will make the most generally useful 
parts most readily accessible. The derived or analytical table 
is a different statistical device. A derived table is essentially 
deficient if it fails to exhibit a carefully formulated analysis. It 
should be constructed to assist a specific interpretation. Every 
effort should be made to make the table simple. It should con- 
tain only those items valuable to the analysis, arranged so as 
to encourage the deductions the reader is expected to draw. If 
any line is to be traced between statistical tabulation and statis- 
tical analysis, the primary table displays the results of tabula- 
tion, the derived table the results of analysis. 

Despite this fundamental distinction between primary and 
derived tables, it is to be admitted in the first place that the de- 
rived table is not altogether free from the influences of format which 
play so important a part in shaping the primary table. For 
example, if the number of subdivisions in one classification of an 
analysis is much greater than in the other, it may be necessary 
to put the more extended classification in the stub merely be- 
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cause stub-capacity is normally so much greater than caption- 
capacity. Similarly, if the designations in one classification 
are much longer than in the other, it may be necessary to place 
the classification with longer headings in the stub, since neither 
of the alternatives — printing the longer headings vertically 
at the top of the columns, or widening the columns to accommo- 
date the longer headings horizontally —is at all satisfactory. 
Such crass considerations as these are at times decisive in deter- 
mining the structure even of the derived table. But they play 
a much less important part with the derived table than with 
the primary table. As a rule, the statistician is able to make 
the general form of the derived table serve the exposition in hand. 

One of the most fundamental questions of structure is the 
assignment of data to columns im some instances, to rows in others. 
This matter should be settled in the derived table with reference 
to what comparisons it is most important to present. Compari- 
son of like items in a column is much easier than of like items in 
arow. It is believed that recognition of this fact will commonly 
throw chronological, geographical, and quantitative classifica- 
tions into the stub, and qualitative classifications into the cap- 
tion; but this is not a necessary outcome. The important 
principle is to use the column position to promote the more sig- 
nificant comparison. 

Arrangement of the several columns and of the several rows 
in the derived table will be determined by the particular charac- 
ter of the analysis in connection with which the table is employed. 
If the analysis is of a temporal distribution, a chronological order 
will be adopted ; if of a spatial distribution, a geographical order. 
If the items are component parts of an aggregate, arrangement 
will be either according to the relative magnitude or importance 
of the item, or according to some other order generally recognized 
in the analysis of the data in question. Presumably the alpha- 
betical arrangement will seldom be followed, since it does not 
directly disclose significant relationships. Ordinarily the pur- 
pose of the analysis will indicate clearly enough the order in which 
the columns or the rows should be placed, 
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Naturally, the arrangement of columns and rows should give 
proper regard to the fact that the most conspicuous position in a 
statistical table is at the top and left. While it is generally true 
that derived tables are designed to bring out relationships rather 
than individual items and that these relationships are properties 
of the table as a whole rather than of particular parts, it may be 
desirable in some tables to focus attention especially upon cer- 
tain more important items. When other considerations will 
permit, these more important items should be placed in the most 
exposed positions of the table: namely, at the top and left next 
to the captions and stub. This rule is a sufficient warrant for 
placing totals at the top and left when they are clearly the most 
significant items of the tabulation, and when placing them at the 
top and left will not give serious offense to the users of the table. 
If either of these conditions is not present, it would seem pref- 
erable to place totals in the positions in which most readers ex- 
pect to find them, namely, at the bottom and right. There ap- 
pears to be no adequate reason for departing from the established 
practice of reading time from top to bottom and left to right. 
In derived tables, figures for later dates should appear toward 
the bottom and right. It is the relation between items, not the 
individual item, which is significant in time series. For many 
reasons we are accustomed to think of the upper or left-hand of 
two figures as the earlier, and we draw our conclusions accord- 
ingly. Furthermore, this rule is already thoroughly incorporated 
in our graphic practices. To have diametrically different rules 
for graphic and tabular presentation will be unfortunate. The 
Census Bureau practice of placing data for most recent dates 
at the top and left is therefore not to be approved for the derived 
table. Effective exposition of the statistical evidences is better 
served by the order which seems most natural to the great ma- 
jority of readers. Arrangements of columns and rows should 
hold fast to the purpose of facilitating interpretation. 

If the dominant purpose of the derived table be kept in mind, 
many problems of tabular arrangement will be readily solved. 
Percentage distributions will be placed next to the corresponding 
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absolute figures or in a separate portion of the table, according 
to the emphasis of the analysis. To facilitate comparisons of 
relationship, the arrangements adopted in one table of an analy- 
sis will be followed as closely in the other tables as other more 
important considerations will permit. Columns and rows which 
are to be compared with one another will be brought as closely 
together as possible. Unnecessary digits will be dropped and 
items given in round numbers to simplify the presentation. 
The aim throughout will be to make the derived table an effec- 
tive instrument of statistical exposition. 

If such are the considerations involved in the construction 
of statistical tables, what conclusions are to be drawn regarding 
the possibilities of standardization of table structure? Upon 
the whole, the opportunities for complete standardization seem 
slight except with regard to the elements from which the table - 
is to be constructed, and certain lesser matters of general arrange- 
ment. More is to be gained at this time from a clear recogni- 
tion of important guiding principles in table construction. Care- 
ful attention must be paid to the difference of purpose in primary 
and derived tables. The primary table must be made to offer 
its items for easy transcription; the derived table, for ready 
deduction. If statistical tables are formed with nice regard for 
these fundamental aims of tabular presentation, standardization 
may well be allowed to proceed as it has heretofore, through 
imitation of the most satisfactory existing practices. Untiring 
experiment with varying forms and ready acceptance of improve- 
ments are for the present the most promising means of securing 
better construction of statistical tables. 
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RULES FOR THE CONSTRUCTION OF STATISTICAL 
CHARTS 


GRAPHIC methods in statistics serve two distinct purposes. 
First, they are an effective means of presenting statistical results ; 
second, they are a valuable aid in analyzing quantitative data. 
These two distinct purposes are to be recognized in formulating 
rules for the construction of statistical charts. 

It can hardly be said that any recognized rules govern the 
employment of graphic aids in statistical analysis. Any type of 
chart or diagram is permissible if it facilitates the analysis. The 
problem is to find the graphic form which best serves the particu- 
lar study in hand. No fixed standards would seem to apply. 

Graphic devices for presenting statistical results, on the other 
hand, conform increasingly to recognized conventions. Some 
of these relate to the representation of particular varieties of 
statistical material and are best dealt with elsewhere in connec- 
tion with the study of particular phases of statistical analysis. 
Other conventions have more general applicability. A_ brief 
examination of these may well precede the study of graphic forms 
for the representation of particular types of material. 

The first general requisite of a satisfactory chart is accuracy. 
By this is not meant mere accuracy of plotting or graphic con- 
struction but rather accuracy of general visual impression. 
Charts should convey a truthful image. Unfortunately, they 
not infrequently mislead in the visual impressions they give. 
Constant care must be exercised to keep them essentially honest 
in design and execution. Every step in the construction of a 
chart must be judged with reference to the truthfulness of the 
story the chart is likely to tell the readers for whom it is intended. 

The second requisite of satisfactory graphic representation 

1 This appendix may be used with Chapter V. 
4Il 
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is simplicity. It is the special function of the chart to convey an 
easy and effectual account of the data —a bird’s-eye view of the 
situation being presented. Interpretation of the chart should 
not require prolonged study. Care should therefore be taken 
not to complicate the figure ; a diagram should be kept relatively 
simple. Charts fail much more frequently because they include 
too much than because they do not include enough. It is rarely 
desirable to show several curves in the same diagram. If the 
inclusion of letters or figures in the chart has the effect of robbing 
the diagram of its simplicity, the letters or figures should be placed 
in accompanying tables rather than inside the diagram. The 
chart should carry no unnecessary encumbrances. Its meaning 
should be clear upon brief examination. 

A third requisite of satisfactory charting is visual effectiveness. 
The diagram should catch and hold the attention of the reader. 
The chart is supposed to have advantages over the table and text 
because of its pictorial effect. This effect should be made the 
most of within the limits of truthfulness and simplicity. Ar- 
tistic effect and vividness are among the things to be most defi- 
nitely sought in every graphic exhibit." 


The interests of satisfactory graphic representation are pro- 
moted in no small measure by the adoption of standards of good 
practice. These standards may be used to bar inferior or ob- 
jectionable forms and to give wider and wider currency to forms 
that have special virtues. Furthermore, familiarity with stand- 
ard arrangements and devices promotes ready and accurate 
reading. Undoubtedly, standardization may go so far as to 
cramp the art of statistical graphics, but a moderate amount of 
standardization has a wholesome influence on graphic practices. 


1 This object is sometimes best achieved by resorting to color work. The objection to 
work in colors is that its reproduction is very expensive. Because of this consideration, if 
reproduction is contemplated the work should ordinarily be done entirely in black and 
white, different types of lining or cross-hatching being employed to obtain appropriate dis- 
tinctions within the figure. Some of the simple contrasts to be obtained in this way are 
illustrated in the conventional forms shown in Chart 46 (page 210). 

2 A good deal in the direction of standardization has already been accomplished, largely 
under the influence of Mr. Willard C. Brinton. Mr. Brinton’s excellent book, Graphic 
Methods for Presenting Facts, gives a useful list of rules to be observed in the construction 
of statistical charts, 
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From the point of view of geometric form, standard statis- 
tical charts fall into three groups: 


(1) bar charts; 
(2) line charts; 
(3) statistical maps. 


The latter are applicable only in the case of geographic exhibits 
and are considered at length in connection with the analysis of 
spatial distributions and series. Nothing need be added here 
to the explanation of statistical maps given in that connection. 
Bar and line charts, on the other hand, have a less restricted use 
and may be employed interchangeably in representing a number 
of forms of variation. An examination of conventional practice 
in the construction of bar and line charts therefore may be ap- 
propriately undertaken in this discussion of the more general 
rules of statistical graphics. . 

Bar charts rest upon the general principle of representation 
of different magnitudes in the variable by parallel bars of cor- 
respondingly different lengths. The bars may be placed in either 
horizontal or vertical position. The bars of any single chart 
should be uniformly spaced and all of the same width. The 
width of the bars, as well as the spacing between them, should be 
so chosen as to give the chart the best possible appearance. The 
best effect is obtained if the spaces between the bars are made 
somewhat narrower than the bars themselves. The bars should 
rest upon a base line drawn at the left-hand edge of the chart in 
the case of horizontal bars and at the bottom of the chart in the 
case of vertical bars. The scale upon which the length of the 
bars is laid off should be indicated at right angles to the base 
line, preferably at the top in the case of horizontal bar charts 
and at the extreme left in the case of vertical bar charts. The 
scale should read from left to right for horizontal bars, and from 
bottom to top for vertical. The items in a horizontal bar chart 
should be read from top to bottom; in a vertical bar chart, from 
left to right. Light parallel lines may be drawn behind the bars 
at intervals of the scale to assist the eye in reading the chart. 
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Bar diagrams in standard form appear in Charts 1 and 5 on pages 
53 and 61 above. 

A line chart is ordinarily constructed on a system of codrdinate 
lines. These lines constitute what is commonly called the grid. 
The lines of the grid parallel two axes, or reference lines, which 
are ordinarily drawn at right angles to each other, one horizon- 
tally and the other vertically. The horizontal axis is usually 
referred to as the x-axis, the vertical axis as the y-axis. Scales 
of the variables to be shown in the chart are given along these 


Y CHART 84. COORDINATE PLOTTING 
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axes. The horizontal scale is laid off from left to right, the zero 
of the scale appearing sometimes, though not always, at the foot 
of the vertical axis; the vertical scale from bottom to top, the 
zero of the scale appearing usually at the intersection of the 
y-axis with the x-axis. The figure appearing on page 1109 is a line 
chart in standard form. 

The line chart lends itself to effective representation of many 
types of data. On a system of codrdinate lines constructed as 
shown in Chart 84, it is easy to plot the values of any two variables 
appearing in combination. Any point on the grid is to be 
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referred to two separate scales — the scale on X and the scale on Y. 
Any point therefore represents a value of the X variable in com- 
bination with a value of the Y variable. Thus, P in Chart 84 
obviously represents X = 3, Y Any other point in the 
plot is to be similarly read. 
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On the same principle a straight line across the grid repre- 
sents a succession of values of X and Y in combination. Thus, 
Q and R are just as much points on the line as P. At Q, 
X =s, Y=14; at R, X = 8, and Y = 20. In general, for 
points on this line, the relationship between X and Y is 
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6. For curves having a scale representing percentages, it is usually 
desirable to emphasize in some distinctive way the 100 per cent line 
or other line used as a basis of comparison. 
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since such a diagram does not represent 
the beginning or end of time. 
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Y=2X +4. This simple algebraic equation thus represents 


the line shown in the chart.!' Line charts constructed on coérdi- 
nate grids serve admirably in the representation of many kinds 


of variables. 


1 The line MN is given by the expression Y = 30 — X. 
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Standards for graphic representation somewhat more specific 
than those relating to the geometric forms are to be recognized. 
An attempt to state such standards has been made in the report 
of a Joint Committee on Standards for Graphic Presentation, 
appointed by certain scientific associations and government 
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offices. This report was first published in late 1915. (See Quar- 
lerly Publication of the American Statistical Association, Decem- 
ber, 1915, pp. 790-797.) Though preliminary in form and never 


1 Seventeen representatives served on this Committee, Mr. Brinton, as representative 
of the American Society of Mechanical Engineers, acting as chairman. 
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subsequently followed by any more definitive document, the report 
has had a decided influence on graphic practices. It is given in 
full on pages 415 to 4109. 

Much of the art of effective graphic presentation is to be ac- 
quired only through practice. Artistic taste also plays a con- 
siderable part. The proportions in which the figure must be 
constructed to obtain the most satisfactory appearance is a point, 
for example, which cannot be covered by any hard-and-fast rule. 
Familiarity with various labor-saving devices, such as especially 
prepared papers and drafting instruments, is also highly desir- 
able. Careful adaptation of graphic forms to the different varie- 
ties of material should be kept constantly in mind.? With proper 
practice, a conscientious regard for the fundamental requisites 
of satisfactory charting, and a recognition of the standards of 
graphic presentation, the use of graphic forms may be made a 
most effective and desirable means of presenting statistical results. 


1In general, diagrams have a more pleasing appearance if they are somewhat wider than 
they are high. If the ratio of width to height is made approximately three to two, results 
are usually satisfactory. 

2 It is because of the importance of this point that the explanation of graphic methods 
in the present text is given largely in connection with the different phases of analysis. 
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APPENDIX G 


TABLE OF SQUARES, SQUARE RooTS, AND RECIPROCALS — I TO 1000 


NUMBERS SQUARES Sa. Roots RECIPROCALS 
I I 1.0000 1.000000 
2 4 1.4142 .500000 
3 9 1.7321 333333 
4 16 2.0000 «250000 
5 25 2.2361 . 200000 
6 36 2.4495 .166667 
7 49 2.6458 .142857 
8 64 2.8284 .125000 
9 81 3.0000 -IIIIII 

IO 100 3.1623 .I00000 
II 121 3.3166 .090909 
12 144 3.4641 .083333 
it) 169 3.6056 .070923 
14 196 3.7417 .071429 
I5 225 3.8730 .066667 
16 256 4.0000 .062500 
17 289 4.1231 .058824 
18 324 4.2426 .055556 
19 361 - 4.3589 .052632 
20 400 4.4721 .050000 
ar 441 4.5826 .047619 
22 484 4.6904 045455 
23 529 4.7958 .043478 
24 576 4.8990 .041667 
25 625 5.0000 .040000 
26 676 5.0990 .038462 
27 729 5.1962 .037037 
28 784 5.2915 -035714 
29 841 5.3852 .034483 
30 goo 5-4772 033333 
31 961 5.5078 .032258 
32 1024 5.6569 .031250 
33 1089 5.7446 030303 
34 1156 5.8310 .029412 
35 1225 5.9161 .028571 
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TABLE OF SQUARES, SQUARE Roots, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Sag. Roots RECIPROCALS 
36 1296 6.0000 .027778 
37 1369 6.0828 .027027 
38 1444 6.1644 .026316 
39 1521 6.2450 .025041 
40 1600 6.3246 .025000 
41 1681 6.4031 .024390 
42 1764 6.4807 .023810 
43 1849 6.5574 .023256 
44 1936 6.6332 .022727 
45 2025 6.7082 .022222 
46 2116 6.7823 .021739 
47 2209 6.8557 .021277 
48 2304 6.9282 .020833 
49 2401 7.0000 .020408 
50 2500 7.0711 .020000 
51 2601 7.1414 .019608 
52 2704 7.2111 .019231 
53 2809 7.2801 .018868 
54 2916 7.3485 .O18519 
55 3025 7.4162 .018182 
56 3136 7.4833 .017857 
57 3249 7-5498 .017544 
58 3304 7.6158 .O17241 
59 3481 7.0811 .016949 
60 3600 7.7460 .016667 
61 B70 7.8102 -016393 
62 3844 7.8740 016129 
63 3969 7-9373 .015873 
64 4096 8.0000 .015625 
65 4225 8.0623 .01 5385 
66 4356 8.1240 .O15152 
67 4489 8.1854 .014925 
68 4624 8.2462 .014706 
69 4761 8.3066 .014493 
70 4900 8.3666 .014.286 
71 5041 8.4261 4 .014085 
72 5184 8.4853 .013889 
73 5329 8.5440 .013699 
74 5470 8.6023 .O13514 
75 5625 8.6603 .01 3333 
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TABLE OF SQUARES, SQUARE ROOTS, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Sag. Roots RECIPROCALS 
76 5776 8.7178 .013158 
77 5929 8.7750 .01 2987 
78 6084 8.8318 .O1 2821 
79 6241 8.8882 .012658 
80 6400 8.9443 .O12500 
81 6561 9.0000 .012346 
82 6724 9.0554 .O12195 
83 6889 9.1104 .01 2048 
84 7056 9.1652 ,OLIQO5 
85 7225 9.2195 .O11765 
86 7396 9.2736 .011628 
87 7569 9.3274 .OT1494 
88 7744 9.3808 .O11364 
89 7921 9.4340 .O11236 
go 8100 9.4868 ,OIIIII 
91 8281 9.5394 .010989 
92 8464 9.5917 .010870 
93 8649 9.6437 .010753 
04 8836 9.6954 010638 
95 9025 9.7468 010526 
96 9216 9.7980 .O10417 
97 9409 9.8489 .010309 
98 9604 9.8995 .O10204 
99 9801 9.9499 .OIOIOI 

100 10000 10.0000 .O10000 
IOI 10201 10.0499 .0OQ90I 
102 10404 10.0995 .009804 
103 10609 10.1489 .009 709 
104 10816 10.1980 .009615 
105 11025 10.2470 .009524 
106 11236 10.2956 .000434 
107 11449 10.3441 .009346 
108 11664 10.3923 .009259 
109 11881 10.4403 .009174 
IIo L100 10.4881 .009QOQI 
III 12321 10.5357 .009009 
112 12544 10.5830 .008929 
113 12769 10.6301 .008850 
114 12996 10.6771 .008772 
II5 13225 10.7238 .008696 


APPENDIX G 429 


TABLE OF SQUARES, SQUARE ROOTS, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Sq. Roots RECIPROCALS 
116 13456 10.7703 .0086 21 
117 13689 10.8167 .008547 
118 13924 10,8628 .008475 
II9Q I4161 10.9087 .008403 
I20 14400 10.0545 .008333 
I2I 14641 II.0000 .008 264 
122 14884 I1.0454 .008197 
123 15129 II.0905 .008130 
124 15370 Tier 355 .008065 
125 15625 11.1803 .008000 
126 15876 II.2250 .007937 
127 16129 11.2604 .007874 
128 16384 LR CGH .007813 
129 16641 11.3578 .007752 
130 16900 11.4018 .007692 
131 17161 11.4455 .007634 
132 17424 11.4891 .007576 
133 17689 11.5326 .007519 
134 17956 11.5758 .007463 
135 18225 11.6190 .007407 
136 18406 11.6619 .007353 
137 18769 II.7047 .007299 
138 19044 11.7473 .007246 
1390 19321 11.7898 .007194 
140 19600 11.8322 .007143 
I4I 19881 11.8743 .007092 
142 20164 11.9164 .007042 
143 20449 11.9583 .006993 
144 20736 12.0000 .006944 
145 21025 12.0416 .006897 
146 21316 12.0830 006849 
147 21609 12.1244 .006803 
148 21904 12.1655 .006757 
149 22201 12.2066 .006711 
150 22500 12,2474 .000667 
151 22801 12.2882 .006623 
152 23104 12.3288 .006579 
153 23400 12.3603 .006536 
154 23716 12.4097 .006494 
155 24025 12.4499 .006452 
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TABLE OF SQUARES, SQUARE Roots, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Sg. Roots RECIPROCALS 
156 24336 12.4900 .000410 
157 24649 12.5300 .006369 
158 2490604 12.5698 .006329 
159 25282 12.6095 .006289 
160 25600 12.6491 .006250 
161 25921 12.6886 .006211 
162 26244 12.7279 .0061 73 
163 26569 12.7671 .006135 
164 26896 12.8062 .006098 
165 27225 12.8452 .006061 
166 27556 12.8841 .006024 
167 27889 12.9228 .005988 
168 28224 12.9615 .005952 
169 28561 13.0000 .005917 
170 28900 13.0384 .005882 
171 29241 13.0767 .005848 
72 29584 13.1149 .005814 
173 29929 13.1529 .005780 
174 30276 13.1909 .005747 
175 30625 13.2288 .005714 
176 30976 13.2665 .005682 
177 31329 13.3041 .005650 
178 31684 13.3417 .005618 
179 32041 - 13.3791 .005587 
i80 32400 13.4164 .005556 
181 32761 13.4530 .005525 
182 33124 13.4907 .005495 
183 33489 13.5277 .005464 
184 33856 13.5647 .005435 
185 34225 13.6015 .005405 
186 34596 13.6382 .005376 
187 34969 13.6748 .005348 
188 35344 135/503 005319 
189 35721 13.7477 .005 201 
190 36100 13.7840 .005 263 
IQI 36481 13.8203 005236 
192 36864 13.8564 .005 208 
193 37249 13.8924 .005181 
194 37636 13.9284 s005155 


195 38025 13.9642 .005128 
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TABLE OF SQUARES, SQUARE Roots, AND RECIPROCALS — 
I TO 1000 — Continued 


NUMBERS SQUARES Sa. Roots RECIPROCALS 
196 38416 I4.0000 .005102 
1Q7 38809 14.0357 .005076 
198 39204 14.0712 .OO5051 
199 39601 14.1007 .005025 
200 40000 14.1421 .005000 
201 40401 14.1774 .004975 
202 40804 14.2127 .004950 
203 41209 14.2478 .004926 
204 41616 14.2829 .004902 
205 42025 14.3178 .004878 
206 42436 "14.3527 .004854 
207 42840 14.3875 .004831 
208 43204 14.4222 .004808 
209 43681 14.4568 .004785 
210 44100 14.4914 .004.762 
211 44521 14.5258 .004.739 
212 44044 14.5002 .004717 
213 45369 14.5945 .004695 
214 45790 14.6287 004673 
Brs 46225 14.6629 .004651 
216 46656 14.6969 .004630 
217 47089 14.7309 .004608 
218 47524 14.7048 .004587 
219 47901 14.7986 .004566 
220 48400 14.8324 .004545 
227 48841 14.8661 .004525 
222 49284 14.8997 004505 
223 49729 14.9332 .004484 
224 50176 14.9666 .004464 
225 50625 15.0000 .004444 
226 51076 15.0333 .004.425 
227 51520 15.0065 .004405 
228 51984 15.0097 .004386 
229 52441 Dogo 7) .004367 
230 52900 15.1658 004.348 
231 53301 15.1987 .004329 
232 53824 15.2315 .004310 
233 54289 15.2043 .004.292 
234 54756 15.2971 .004274 
235 55225 15.3297 004255 
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TABLE OF SQUARES, SQUARE ROOTS, AND RECIPROCALS— 
I TO 1000— Continued 


NUMBERS SQUARES Se. Roots } RECIPROCALS 
236 55696 15.3023 ‘ .004237 
237 56169 15.3948 .004219 
238 56644 15.4272 .004202 
239 57121 15.4596 .004184 
240 57600 15.4919 .004167 
241 58081 15.5242 .004149 
242 58564 15.5563 .004132 
243 59049 15.5885 .OO41I5 
244 59530 15.6205 .004098 
245 60025 15.6525 .004082 
246 60516 15.6844 .004065 
247 61009 15.7162 .004049 
248 61504 15.7480 .004032 
249 62001 15.7797 .004016 
250 62500 15.8114 .004000 
251 63001 15.8430 .003984 
252 63504 15.8745 .003968 
253 64009 15.9060 -003953 
254 64516 15.9374 -003937 
255 65025 15.9087 .003922 
256 65536 16.0000 .003906 
257 66049 16.0312 .003891 
258 66564 16.0624 .003876 
259 67081 16.0935 .003861 
260 67600 16.1245 .003846 
261 68121 16.1555 .003831 
262 68644 16.1864 .003817 
263 69169 16.2173 .003802 
264 69696 16.2481 .003788 
265 70225 16.2788 .003774 
266 70750 16.3095 .003759 
267 71289 16.3401 .003745 
268 71824 16.3707 .003731 
209 72301 16.4012 .003717 
270 72900 16.4317 .003 704 
271 73441 16.4621 .003690 
272 73984 16.4924 .003676 
273 74529 16.5227 .003663 
274 75076 16.5529 .003650 


275 75625 16.5831 .003636 
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TABLE OF SQUARES, SQUARE Roots, AND RECIPROCALS — 
I TO 1000 — Continued 


NUMBERS SQUARES Sa. Roots RECIPROCALS 
276 76176 16.6132 .003623 
277 76729 16.6433 .003610 
278 77284 16.6733 003597 
279 77841 16.7033 003584 
280 78400 16.7332 .003571 
281 78961 16.7631 .003559 
282 79524. 16.7929 .003546 
283 80089 16.8226 .003534 
284 80656 16.8523 .003521 
285 81225 16.8819 .003 509 
286 81706 16.9115 .003497 
287 82369 16.9411 .003484 
288 82044 16.9706 .003472 
2890 83521 I7.0000 .003460 
290 84100 17.0294 003448 
201 84681 17.0587 .003436 
292 85264 17.0880 .003425 
203 85849 Wc uge .003413 
204 86436 17.1464 .003401 
205 87025 17.1756 .003390 
296 87616 17.2047 .003378 
207 882009 17.2337 .003367 
298 © 88804 17.2627 .003356 
209 89401 17.2916 .003344 
300 goo000 17.3205 .003333 
301 go6or 17.3494 .003322 
302 Q1204 17.3781 .003311 
303 91809 17.4009 .003300 
304, 92416 17.4356 .003 289 
305 93025 17.4642 .003279 
306 93636 17.4920 .003 268 
3°7 94249 17.5214 .003257 
308 94864 17.5499 .003247 
309 95481 17.5784 ,003236 
310 9Q6100 17.6068 .003226 
311 96721 17.6352 .003215 
312 97344 17.6635 .003205 
313 97969 17.6918 .003195 
314 98506 17.7200 .003185 
315 909225 17.7482 .003175 
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TABLE OF SQUARES, SQUARE ROOTS, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Sg. Roots RECIPROCALS 
316 99856 17.7704 .003165 
B19) 100489 17.8045 .003155 
318 IOII24 17.8326 .003145 
319 1or761 17.8606 .003135 
320 102400 17.8885 .003125 
321 103041 17.9165 .0031I5 
322 1036084 17.9444 .003106 
323 104329 17.9722 .003096 
324 104976 18.0000 .003086 
325 105625 18.0278 .003077 
326 106276 18.0555 .003007 
327 106929 18.0831 .003058 
328 107584 18.1108 .003049 
320 108241 18.1384 .003040 
330 108900 18.1659 .003030 
331 109561 18.1934 .00302T 
332 II0224 18.2209 .003012 
333 110889 18.2483 .003003 
334 I11556 18.2757 .002994 
335 112225 18.3030 .002985 
336 112896 18.3303 .002976 
337 113569 18.3576 002967 
338 114244 18.3848 .002959 
339 I14Q21 18.4120 .002950 
340 115600 18.4391 .002941 
341 116281 18.4662 .002933 
342 116964 18.4932 .002924 
343 117649 18.5203 .002Q15 
344 118336 18.5472 .002907 
345 I19025 18.5742 .002899 
346 119716 18.6011 .002890 
347 120409 18.6279 .002882 
348 I21104 18.6548 .002874 
349 21801 18.6815 .002865 
350 122500 18.7083 002857 
351 123201 18.7350 .002849 
352 123904 18.7617 002841 
353 124609 18.7883 .002833 
354 125316 18.8149 .002825 
355 126025 18.8414 .002817 


ee oe | ee ee eee pS ee Se ee 


APPENDIX G 438 


TABLE OF SQUARES, SQUARE Roots, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Se. Roors RECIPROCALS 
356 126736 18.8680 .002809 
357 127449 18.8944 .002801 
358 128164 18.9209 .002793 
359 128881 18.0473 .002786 
360 129600 18.9737 .002778 
361 130321 IQ.0000 .002770 
362 131044 19.0263 .002762 
363 131769 19.0526 .002755 
364 132496 19.0788 .002747 
305 133225 IQ.1050 .002740 
366 133956 19.1311 .002732 
307 134689 19.1572 .002725 
368 135424 19.1833 .002717 
369 136161 19.2004 .002710 
370 136900 19.2354 .002703 
371 137641 19.2614 .002695 
372 138384 19.2873 .002688 
373 139129 19.3132 .002681 
374 139876 19.3391 .002674 
375 140625 19.3649 .002667 
376 141376 19.3907 .002660 
377 142120 19.4165 .002653 
378 142884 19.4422 .002646 
379 143641 19.4679 .002639 
380 144400 19.4930 .002632 
381 145161 19.5192 .002625 
382 145924 19.5448 .002618 
383 146689 19.5704 .002011 
384 147456 19.5959 .002604 
385 148225 19.6214 .002597 
386 148996 19.6469 .002591 
387 149769 19.6723 .002584. 
388 150544 19.6977 .002577 
389 151321 19.7231 .002571 
390 152100 19.7484 .002564. 
301 152881 190.7737 .002558 
392 153064 19.7990 .002551 
393 154449 19.8242 1002545 
394 155230 19.8404 .002538 


395 156025 19.8746 .002532 
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TABLE OF SQUARES, SQUARE ROOTS, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Sa. Roots RECIPROCALS 
396 156816 19.8997 .002525 
397 157609 19.9249 002519 
398 158404 19.9499 .002513 
399 159201 19.9750 .002506 
400 160000 20.0000 .002500 
401 160801 20.0250 .002494 
402 161604 20.0499 .002488 
403 162409 20.0749 .002481 
404 163216 20.0998 .002475 
405 164025 20.1246 .002469 
406 164836 20.1494 .002463 
407 165649 20.1742 .002457 
408 166464 20.1990 .002451 
409 167281 20.2237 .002445 
410 168100 20.2485 .002439 
4II 168921 20.2731 .002433 
412 169744 20.2978 .002427 
413 170569 20.3224 .002421 
414 171396 20.3470 .002415 
415 172225 20.3715 .002410 
416 173056 20.3961 .002404 
417 173889 20.4206 002398 
418 174724 20.4450 .002392 
4x9 175501 20.4695 .002387 
420 176400 20.4939 .002381 
421 177241 20.5182 .002375 
422 178084 20.5426 .002370 
423 178929 20.5670 .002364 
424 179776 20.5913 .002358 
425 180625 20.6155 .002353 
426 181476 20.6398 .002347 
427 182329 20.6640 .002342 
428 183184 20.6882 .002336 
429 184041 20.7123 .002331 
430 184900 20.7364. .002326 
431 185761 20.7605 .002320 
432 186624 20.7846 .002315 
433 187489 20.8087 .002309 
434 188356 20.8327 .002304 


435 189225 20.8567 .002299 


oS 
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TABLE OF SQUARES, SQUARE Roots, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Sg. Roors RECIPROCALS 
436 190096 20.8806 .002294 
437 190969 20.9045 .002288 
438 191844 20.9284 .002283 
439 IQ2721 20.9523 .002278 
440 193600 20.9762 .002273 
441 194481 21.0000 .002268 
442 195304 21.0238 .002262 
443 196249 21.0476 .002257 
444 197136 21.0713 .002252 
445 198025 21.0950 .002247 
446 198916 21.1187 .002242 
447 199809 21.1424 .002237 
448 200704 21.1660 .002232 
449 201601 21.1896 .002227 
450 202500 DePie?) .002222 
451 203401 21.2368 .002217 
452 204304 21.2603 .002212 
453 205209 21.2838 .002208 
454 206116 21.3073 .002203 
455 207025 21.3307 .002198 
456 207936 21.3542 .002193 
457 208849 21.3770 002188 
458 209764. 21.4009 .002183 
459 210081 21.4243 .002179 
460 211600 21.4476 .002174 
461 212521 21.4709 .002169 
462 213444 21.4942 .002165 
463 214369 21.5174 ,002160 
4604 215296 21.5407 ,002155 
4605 216225 21.5639 .002T51 
466 217156 21.5870 .002146 
467 218089 21.6102 .002T4I 
468 219024 21.6333 002137 
469 219961 21.6564 .002132 
470 220900 21.6795 .002128 
471 221841 21.7025 .002123 
472 222784 21.7256 .002TI9 
473 223720 21.7486 .OO2114. 
474 224676 Bees .002TIO 


475 225025 21.7045 .002105 


ny 
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TABLE OF SQUARES, SQUARE ROOTS, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Sg. Roots RECIPROCALS 
476 226576 21.8174 .002I01 
477 227529 21.8403 .002096 
478 228484 21.8632 .002092 
479 229441 21.8861 .002088 
480 230400 21.9089 .002083 
481 231361 21.9317 .002079 
482 232324 21.9545 .002075 
483 233289 21.9773 .002070 
484 234256 22.0000 .002066 
485 235225 22.0227 .002062 
486 236196 22.0454 .002058 
487 237169 22.0081 .002053 
488 238144 22.0007 .002049 
489 239121 22-1133 .002045 
490 240100 22.1359 002041 
491 241081 22.1585 .002037 
492 242064 22.1811 .002033 
493 243049 22.2036 002028 
494 244036 22.2261 .002024 
495 245025 22.2486 .002020 
496 246016 22.2711 .002016 
407 247009 22.2935 .002012 
498 248004 22.3159 .002008 
499 249001 22.3383 .002004 
500 250000 22.3607 .002000 
501 251001 22.3830 .001996 
502 252004 22.4054 .001992 
503 253009 OAT .001988 
504 254016 22.4499 .001984 
505 255025 22.4722 .001980 
506 256036 22.4044 .001976 
507 257049 22.5167 .001972 
508 258064 22.5389 .001969 
509 259081 22.5610 .001965 
510 260100 22.5832 .OO1g61 
Sir 261121 22.6053 .001057 
I5i2e) 262144 22.6274 .001953 
513 263169 22.6495 .OO1949 
514 264196 22.6716 .001946 


515 265225 22.6936 .001942 


———— 
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TABLE OF SQUARES, SQUARE Roots, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Sa. Roots RECIPROCALS 
516 266256 22.7156 .001938 
517 267289 22.7370 .001934 
518 268324 22.7506 .OO193I 
519 269361 22.7816 .001927 
520 270400 22.8035 .001923 
521 271441 22.8254 .OOIQIQ 
522 272484 22.8473 .0O1g16 
523 273529 22.8692 .OOIQI2 
524 274570 22.8910 .001908 
525 275625 22.9129 .OO1905 
526 276676 22.9347 -OOIQOI 
S27) 277720 22.9505 .001898 
528 278784 22.9783 .001894 
529 279841 23.0000 .001890 
530 280900 23.0217 .001887 
531 281961 23.0434 .001883 
532 283024 23.0051 .001880 
533 284089 23.0868 .001876 
534 285156 23.1084 .001873 
535 286225 23.1301 .001 869 
536 287206 22.0507 .001 866 
537 288369 2961733) .001862 
538 2809444 23.1948 001859 
539 2900521 23.2164 .001855 
540 291600 23.2379 .001852 
541 292681 23.2504. .001848 
542 293764 23.2809 .001845 
543 294849 23.3024 001842 
544 295936 23.3238 .001838 
545 297025 23.3452 .001835 
546 298116 23.3666 .001832 
547 209209 23.3880 .001828 
548 300304 23.4004 001825 
549 301401 23.4307 .001 821 
550 302500 23.4521 .0o1 818 
551 303601 23.4734 001815 
552 304.704 23.4047 .0o1 812 
553 305809 23.5160 .001808 
554 306916 23.5372 .001 805 
555 308025 23.5584 .001802 


eee 
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TABLE OF SQUARES, SQUARE RooTs, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Sq. Roots RECIPROCALS 
556 309136 23-5797 001799 
557 310249 23.6008 .001795 
558 311364 23.6220 .001 792 
559 312481 23.6432 .001 789 
560 313600 23.6643 .001 786 
561 314721 23.6854 .001 783 
562 315844 23.7005 .001779 
563 316969 23.7270 .001776 
564 318096 23.7487 .001773 
565 319225 23.7697 .001770 
566 320356 23.7908 .001 767 
567 321489 23.8118 .001 764 
568 322024 23.8328 .001761 
569 323701 23.8537 .001757 
570 324900 23.8747 .001754 
571 326041 23.8956 .OO175I 
572 327184 23.9160 .001 748 
573 328329 23.9374 .001 745 
574 320476 23.9583 .001 742 
575 330025 23.9792 .001739 
576 331776 24.0000 .001 736 
577 332920 24.0208 .001733 
578 334084 24.0416 .001 730 
579 335241 24.0024 .001727 
580 336400 24.0832 .001724 
581 337501 24.1039 .OO1721 
582 338724 24.1247 .001718 
583 339889 24.1454 -OO1715 
584 341056 24.1661 .OO1712 
585 342225 24.1868 .0O1 709 
586 343396 24.2074 .001 706 
587 344569 24.2281 .OO1 704 
588 345744 24.2487 .OOIT 701 
589 346921 24.2693 .001698 
590 348100 24.2899 .001695 
591 349281 24.3105 .001692 
592 350464 24.3311 .001689 
593 351649 24.3516 .001686 
504 352836 24.3721 .001684 


595 354025 24.3926 .OO1681 


SGC 
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TABLE OF SQUARES, SQUARE ROooTS, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Se. Roots RECIPROCALS 
596 355216 24.4131 .001678 
597 356409 24.4336 .001075 
598 357604 24.4540 .001672 
5909 358801 24.4745 .001669 
600 360000 24.4949 .001667 
601 361201 24.5153 .001664 
602 362404 24.5357 .0o1661 
603 363609 24.5561 .001658 
604 304816 24.5704 .001656 
605 366025 24.5907 .001653 
606 367236 24.6171 .001650 
607 368449 24.0374 .001647 
608 369664 24.6577 001645 
609 370881 24.6779 .001642 
610 372100 24.6982 .001639 
611 373321 24.7184 .001637 
612 374544 24.7386 .001634 
613 375709 24.7588 .OOT631 
614 3769096 24.7790 .001629 
615 378225 24.7992 .001626 
616 379456 24.8193 .001623 
617 380689 24.8395 .00tb2 
618 381924 24.8596 .001618 
619 383161 24.8707 .0or616 
620 384400 24.8998 .001613 
621 3856041 24.9199 .OO1610 
622 386884 24.9399 .001608 
623 388129 24.9600 .001605 
624 389376 24.9800 .001603 
625 390625 25.0000 .001600 
626 391876 25.0200 .001597 
627 393129 25.0400 .0O1 595 
628 304384 25.0599 .0O1 592 
629 3956041 25.0799 .OO1 590 
630 396900 25.0998 .001587 
631 308161 25.1197 .001585 
632 300424 25.1396 .001 582 
633 400689 25.1505 .001 580 
634 401956 25.1794 .001577 
635 403225 25.1992 .001575 
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TABLE OF SQUARES, SQUARE Roots, AND RECIPROCALS — 
I TO 1000— Continued 


NuMBERS SQUARES Se. Roors RECIPROCALS 
636 404496 25.2190 .001572 
637 4057609 25.2389 .0O1570 
638 407044 25.2587 .001567 
639 408321 25.2784 .001565 
640 409600 25.2982 .001 563 
641 410881 25.3180 .001560 
642 412164 25.3377 .001558 
643 413449 25.3574 .O01555 
644 414736 25-3772 001553 
645 416025 25.3969 .OO1550 
646 417316 25.4165 .001548 
647 418609 25.4362 .001546 
648 419904 25.4558 .001543 
649 421201 25.4755 .OOI541 
650 422500 25.4951 001538 
651 423801 25.5147 .001 536 
652 425104 25.5343 .001534 
653 426409 25.5539 -OO1 531 
654 427716 25-5734 .001 529 
655 429025 25.5930 .001527 
656 430336 25.6125 .OO1 5 24 
657 431649 25.6320 .OO1522 
658 432964 25.6515 .OO1 520 
659 434281 25.6710 .OOI517 
660 435600 25.6905 .OOI5I5 
661 430921 25.7099 .OOI513 
662 438244 25.7204 ,OOIS5II 
663 439569 25.7488 .001 508 
664 440896 25.7082 .001 506 
665 442225 25.7876 -OO1 504, 
666 443556 25.8070 OO §02 
667 444889 25.8263 .001499 

668 446224 ‘ 25.8457 .001497 
669 447561 25.8650 .OO1 495 
670 448900 25.8844 .001493 
671 450241 25.0037 .0O1490 
672 451584 25.9230 .001488 
673 452929 25.9422 .001486 
674 454276 25.9615 .001484 


675 455625 25.9808 .0o1 481 


APPENDIX G 443 


TABLE OF SQUARES, SQUARE ROOTS, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Se. Roots RECIPROCALS 
676 456976 26.0000 .001479 
677 458329 26.0192 .001477 
678 459084 26.0384 .001475 
679 401041 26.0576 .001473 
680 462400 26.0768 .OO1471 
681 463761 26.0960 .001468 
682 465124 20.1151 .001466 
683 466489 20.1343 .001464 
684 467856 26.1534 .001462 
685 469225 26.1725 .001460 
686 470596 26.1916 .001458 
687 471969 26.2107 .001456 
688 473344 26.2298 .001453 
689 474721 26.2488 .OOT451 
690 476100 26.2679 .001449 
691 477481 26.2869 .001447 
692 478864 26.3059 .001445 
693 480249 26.3249 .001 443 
694 481636 26.3439 .OOI1441 
6905 483025 26.3629 .001439 
696 484416 26.3818 .001437 
697 485809 26.4008 .001435 
698 487204 26.4197 .001433 
699 488601 26.4386 .001431 
700 490000 26.4575 .00T429 
701 491401 26.4704 .001427 
702 492804 26.4953 .001425 
703 494209 20.5141 .001422 
704 495616 26.5330 .001420 
705 497025 26.5518 .001418 
706 498436 26.5707 .001416 
707 499849 26.5895 .OO1414 
708 501264 26.6083 .OOI412 
709 502681 26.6271 .OOT4IO 
710 504100 26.6458 .001408 
pi 505521 26.6646 .001406 
712 506044 26.6833 .0OT 404 
713 5083690 26.7021 .001403 
714 509796 26.7208 .OOT401 
715 511225 260.7305 .001 399 


a 
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TABLE OF SQUARES, SQUARE ROOTS, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Se. Roots RECIPROCALS 
716 5120656 26.7582 -001397 
qx7, 514089 26.7769 .001395 
718 515524 26.7955 .001393 
719 516961 26.8142 -OO13QI 
720 518400 26.8328 .001 389 
721 519841 26.8514 .001387 
722 521284 26.8701 .001385 
723 522720 26.8887 .001383 
724 524176 26.9072 .001381 
725 525625 26.9258 .001379 
726 527076 26.9444 .001377 
727 528529 26.9629 .001376 
728 529984 20.9815 .001374 
729 531441 27.0000 .001372 
730 532900 27.0185 .001370 
731 534361 27.0370 .001 368 
732 535824 27.0555 .001 366 
733 537289 27.0740 -001 364 
734 538756 27.0024 .001 362 
735 540225 27.1109 .0O1 361 
736 541606 27.1203 .001 359 
737 543169 27.1477 .001357 
738 544044 27.1662 .001355 
739 546121 27.1846 .001353 
740 547600 27.2029 .OO1351 
741 549081 2722 .001350 
742 550564 27.2397 .001348 
743 552049 27.2580 .001 346 
744 553536 27.2764 .001344 
745 555025 27.2947 .001342 
746 556516 27.3130 .001 340 
747 558009 27-3313 .001339 
748 559504 27-3406 .001 337 
749 561001 27.3079 -001335 
750 562500 27.3861 .001 333 
751 564001 27.4044 .001 332 
752 565504 27.4226 .001 330 
753 567009 27.4408 .001 328 
754 568516 27.4501 .001 326 
755 570025 27.4773 .001 325 
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TABLE OF SQUARES, SQUARE Roots, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Sq. Roors RECIPROCALS 
756 571536 27.4955 .001323 
7d 573049 27.5136 .001321 
758 574504 27.5318 .001319 
759 576081 27.5500 .001318 
760 577600 27.5081 .001316 
761 579121 27.5862 .O01 314 
762 580644 27.6043 .OO1 312 
763 582169 27.6225 .OOI3II 
704 583696 27.6405 .001 309 
765 585225 27.6586 .001 307 
766 586756 27.6767 .001 305 
767 588289 27.6948 .001 304 
768 589824 27.7128 ,001 302 
769 591361 27.7308 .00I 300 
770 592900 27.7489 .001 299 
772 594441 27.7669 .001 207 
772 595984 27.7849 .001 295 
773 597529 27.8029 .001 294 
774 599076 27.8209 .OO1 292 
775 600625 27.8388 .001290 
776 602176 | 27.8568 .001 289 
777 603729 27.8747 .001287 , 
778 605284 27.8927 .001 285 
779 606841 27.9106 ,001 284 
780 608400 27.9285 .001 282 
781 609961 27.9464 .001 280 
782 611524 27.9643 .001 279 
783 613089 27.9821 .001277 
784 614656 28.0000 .001276 
785 616225 28.0179 .001 274 
786 617796 28.0357 .001 272 
787 6193690 28.0535 .OO1 271 
788 620044 28.0713 001 269 
789 622521 28.0891 .001 267 
790 624100 28.1069 .001 266 
791 625681 28.1247 .001 264 
792 627264 28.1425 .001 263 
793 628849 28.1603 ,OO1 261 
794 630436 28.1780 .001 259 
705 632025 28.1957 .001 258 
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TABLE OF SQUARES, SQUARE Roots, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS SQUARES Sg. Roots RECIPROCALS 
706 633616 28.2135 .001256 
707 635200 28.2312 .OO1 255 
798 636804 28.2489 .00I 253 
799 638401 28.2666 ,OO1 252 
800 640000 28.2843 .OO1 250 
801 641601 28.3019 .001 248 
802 643204 28.3196 .OO1 247 
803 644809 28.3373 .001 245 
804 646416 28.3549 .OO1 244 
805 648025 28.3725 .OO1 242 
806 649636 28.3901 .OOI 241 
807 651249 28.4077 .001 239 
808 652864 28.4253 .001 238 
809 654481 28.4429 .001 236 
810 656100 28.4605 .001 235 
811 657721 28.4781 .001 233 
812 650344 28.4956 .001 232 
813 660969 28.5132 .001 230 
814 662506 28.5307 .001 229 
815 664225 28.5482 .001 227 
816 665856 28.5657 .001 225 
817 667489 28.5832 .001 224 
818 669124 28.6007 .001 222 
819 670761 28.6182 .OO1 221 
820 672400 28.6356 .001220 
821 674041 28.6531 .001 218 
822 675684 28.6705 .OO1 217 
823 677329 28.6880 ,OOT 215 
824 678976 28.7054 .OOI 214 
825 680625 28.7228 .OO1 212 
826 682276 28.7402 .OCOI21I 
827 683929 28.7576 .OOI 209 
828 685584 28.7750 .001 208 
829 687241 28.7924 .001 206 
830 688900 28.8097 .OO1 205 
831 690561 28.8271 .001 203 
832 692224 28.8444 .001 202 
833 693889 28.8617 .OOI 200 
834. 695556 28.8701 .OOTI99 


835 697225 28.8964 .0or198 


i 
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TABLE OF SQUARES, SQUARE Roots, AND RECIPROCALS — 
I TO 1000— Continued 


NUMBERS 


836 
837 


SQUARES 


698896 
700569 
702244 
703921 
705600 


707281 
708964 
710049 
712336 
714025 


715716 
717409 
7IQIO4 
720801 
722500 


724201 
725904 
727609 
729316 
731025 


732736 
734449 
730164 
737881 
739600 


741321 
743044 
744769 
746496 
748225 


749956 
751689 
753424 
755161 
756900 


758641 
760384 
762120 
763876 


765625 


Sag. Roots. 


28.0137. 


28.9310 
28.9482 


28.0655 
28.9828 


29.0000 
29.0172 
29.0345 
29.0517 
29.0089 


29.0861 
29.1033 
29.1204 
29.1376 
29.1548 


29.1719 
29.1890 
29.2062 
20.2233 
29.2404 


29.2575 
29.2746 
29.2916 
29.3087 
29.3258 


29.3428 
29.3508 
29.3769 
29.3939 
29.4109 


29.4279 
29.4449 
29.4618 
29.4788 
29.4958 


20.5127 
29.5206 
29.5466 
29.5635 
29.5804 


RECIPROCALS 


-O0L196 
.OOTIOQ5 
.OOTIO3 
.OOTIQ2 
.OOLIQO 


.OO1189 
.0o1 188 
.oo1 186 
-OO1185 
.001183 


.OO1182 
-OOTI8I 
.OOII79 
.001r78 
.0O1176 


,OO1I75 
.OOTI 74. 
.OO1172 
,OOLI7I 
.OO1I70 


.0or168 
.OO1107 
.00T166 
.OOT164 
.OO1163 


,OOTIOI 
.OO1160 
.OOTI59 
.OO11I57 
.OOT156 


,OOLI55 
.OOTI53 
,OOTI52 
,OOII51 
,OOTI49 


.001148 
.OO1147 
.OOII45 
.OOTT44 
.OO1T43 
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TABLE OF SQUARES, SQUARE Roots, AND RECIPROCALS — 
I TO 1000— Continued 


NuMBERS SQUARES Sq. Roots RECIPROCALS 
876 767376 29.5973 .OOI142 
877 769129 29.6142 .OOII40 
878 770884. 29.6311 .OOTI39 
879 772041 29.6479 .001138 
880 774400 29.6648 .001136 
881 776161 29.6816 .OO1135 
882 7779024, 29.6985 .OO1134 
883 779689 29.7153 .001133 
884 781456 29.7321 ,OOII3I 
885 783225 29.7489 .OOII30 
886 784996 29.7658 .OO11 29 
887 786769 29.7825 .OOII27 
888 788544 29.7993 .OO1126 
889 790321 29.8161 .OOT125 
890 792100 29.8329 .OOII 24 
801 793881 29.8496 .OOITI22 
892 795664 29.8664 .OOII 21 
893 7907449 29.8831 .OOII20 
804 799236 29.8998 .OOIIIQ 
895 801025 29.9166 .OOIII7 
896 802816 29.9323 .OOT116 
897 804609 29.9500 .OOTIIS 
808 806404 29.9666 ,OOIII4 
809 808201 "29.9833 OOTII2 
goo 810000 30.0000 ,OOLIII 
gor 811801 30.0167 ,OOIITIO 
go2 813604 30.0333 .OOTIOQ 
903 815409 30.0500 ,OOLIO7 
904 817216 30.0666 .OOTIOO 
905 819025 30.0832 ,OOILIOS 
906 820836 30.0998 .OOLIO4 
907 822649 30.1164 .OOT103 
908 824464 30.1330 ,OOTIOL 
909 826281 30.1496 .OOIIOO 
gio 828100 30.1662 .OOT0QQ 
QII 820921 30.1828 .0oT0g98 
O12 831744 30.1993 .0OT096 
913 833569 30.2159 .OOT0Q5 
O14 8353096 30.2324 .OOT094 
O15 837225 30.2490 .CO1093 


TABLE OF SQUARES, SQUARE Roots, AND RECIPROCALS — 
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I TO 1000— Continued 


NUMBERS 


916 
9i7 
918 
919 
920 


Q2I 
Q22 
923 
924 
925 
926 
927 
928 
929 
93° 


931 
932 
933 
934 
935 


936 
937 
938 
939 
940 


941 
942 
943 
9044 
945 


946 
947 
948 
949 
950 


gst 
952 
953 
954 
955 


SQUARES 


839056 
840889 
842724 
844561 
846400 


848241 
850084 
851929 
853776 
855625 


857476 
859329 
861184 
863041 
864900 


866761 
868624 
870489 
872356 
874225 


876096 
877969 
879844 
881721 
883600 


885481 
887364 
889249 
891136 
893025 


894916 
896809 
898704 
goo6bor 
902500 


904401 
906304 
908209 
QIo116 
912025 


Se. Roots 


30.2655 
30.2820 
30.2985 
30.3150 
30.3315 


30.3480 
30.3045 
30.3809 
30.3974 
30.4138 


30.4302 
30.4467 
30.4631 
39-4795 
39.4959 


30.5123 
30.5287 
30.5450 
30.5614 
30.5778 


30.5941 
30.6105 
30.6268 
30.6431 
30.6504 


30.6757 
30.6920 
30.7083 
30.7246 
39-7409 


30.7571 
30-7734 
30.7896 
30.8058 
30.8221 


30.8383 
30.8545 
30.8707 
30.8869 
30.9031 


RECIPROCALS 


.OO1092 
.OOIOQI 
.001089 
.001088 
.001087 


.001086 
.001085 
.001083 
.001082 
,OOTO8I 


.001080 
.001079 
.001078 
.001076 
.OO1075 


.OO1074 
.001073 
.001072 
.OOIO07I 
.001070 


.001068 
.001067 
.001006 
.001065 
.0010604 


.001063 
.00106 2 
.001T060 
.0O1059 
.001058 


.001057 
.001056 
.OO1055 
.0O01054 
.OO1053 


.O01052 
.OOTO50 
.00T049 
.001048 
.0O1047 
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TABLE OF SQUARES, SQUARE Roots, AND RECIPROCALS — 


I TO 1000 — Continued 


NuMBERS 


956 
957 
958 
959 
960 


961 
962 
963 
964 
965 


966 
967 
968 
969 
97° 


971 
972 
973 
974 
975 


976 
977 
978 
979 
980° 


981 
982 
983 
984 
985 


986 
987 
988 
989 
990 


991 
992 
993 
994 
995 


SQUARES 


913936 
915849 
917764 
919681 
921600 


923521 
925444 
927369 
929296 
931225 


933156 
935089 
937024 
938961 
940900 


942841 
944784 
946729 
948676 
950625 


952576 
954529 
956484 
958441 
960400 


962361 
964324 
966289 
968256 
9709225 


972196 
974169 
976144 
978121 
980100 


982081 
984064 
986049 
988036 
990025 


Se. Roots 


30.9192 
30.9354 
30.9516 
30.9677 
30.9839 


31. 
.O161 


31 


0000 


31.0322 
31.0483 
31.0644 


31.0805 
31.0966 


31 
31 
31 


SE 


Bits 
.1929 
.2090 


31 
31 


au 


31 
31 


no7 
.1288 
1448 


.1609 


1769 


2250 


.2410 
.2570 
aie 

Bit, 


2730 
2890 


31.3050 


aT 
Saee 


3209 
3369 


31.3528 


Bir. 
Br 


3688 
3847 


31.4006 
31.4166 
31.4325 
31.4484 
31.4643 


31 


.4802 


31.4960 


31 
31 


.5119 
.5278 


31.5430 


RECIPROCALS 


.001046 
.OO1045 
.001044 
.001043 
.0O1042 


.OOIO4I 
.OO1040 
.001038 
.001037 
.001036 


.OOI035 
.001034 
.001033 
.001032 
.OO1031 


.001030 
.001029 
.001028 
.001027 
.001026 


.001025 
.001024 
.001022 
.OO102I 
.001020 


.OOIOIQ 
.coror8 
.OO1O17 
.COT016 
.OOIOTS 


.OOIOT4 
.OOIOI3 
.OOIOI2 
,OOLOII 
.OOIOIO 


.OO100Q 
.001008 
.OO1007 
.001006 
.OOI005 
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TABLE OF SQUARES, SQUARE RooTS, AND RECIPROCALS — 
I TO 1000— Concluded 


NUMBERS SQUARES Sa. Roots RECIPROCALS 
996 992016 31.5595 .OO1004. 
997 994009 31.5753 -001003 
998 996004 31.5911 .001002 
999 998001 31.6070 .OOIOOI 


1000 1000000 31.6228 .OO1000 


APPENDIX H 


RELATIONSHIP BETWEEN 7 AND P 


r p r p r p r 
O105 26 2714 att 5277 .76 -7750 
.0209 27 2818 52 5378 iif 7847 
.0314 28 2922 53 5479 .78 7943 
.0419 29 3025 54 5580 79 8039 
.0524 30 3129 55 5680 80 8135 
.0628 31 3232 56 5781 81 .8230 
0733 32 3335 / 5881 82 8325 
.0838 33 3439 58 5981 83 .8421 
.0942 34 -3542 59 6081 84 .8516 
1047 35 3045 60 .6180 85 .8610 
IIS5I 36 3748 61 6280 .86 .8705 
1256 37 3850 62 6379 87 8799 
1360 38 3935 .63 6478 88 8893 
1465 30 4056 64 6577 89 .8986 
1569 40 4158 .65 6676 .90 .9080 
1674 41 4261 66 6775 OI 9173 
1778 42 4363 LS 6873 92 .9269 
1882 43 4465 .68 6971 -93 9359 
1986 44 4507 69 7069 94 9451 
2091 45 4669 70 7167 95 9543 
2195 46 4771 71 7265 .96 .9635 
2209 47 4872 72 7363 97 9727 
2403 48 4973 73 7460 .98 .9818 
2507 49 5°75 74 7557 99 -9909 
2611 50 5176 nS 7654 1.00 I.0000 
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APPENDIX I 


List OF GENERAL REFERENCES 


BartLey, W. B., and Cummines, J., Statistics, A. C. McClurg, 
Chicago, 1917 

Bow ey, A. L., Elements of Statistics, 4th ed., Scribner, N. Y.., 
1920 

An Elementary Manual of Statistics, Macdonald & Evans, 
London, 1910 
The Measurement of Groups and Series, Macdonald & Evans, 

London, 1903 

Brinton, W. C., Graphic Methods for Presenting Facts, Engineer- 
ing Magazine Co., N. Y., 1914 

Crappock, R. E., Principles and Methods of Statistics, Houghton 
Mifflin, Boston, 1925 

Cuapin, F. S., Field Work and Social Research, Century, N. Y., 
1920 

Davenport, C. B., Statistical Methods, with Special Reference 
to Biological Variation, 3d ed., Wiley, N. Y., 1914 

Davies, G. R., Introduction to Economic Statistics, Century, 
Ney 1022 

ELDERTON, W. P., Frequency Curves and Correlation, C. & E. 
Layton, London, 1906 

ELDERTON, W. P. and E. M., Primer of Statistics, Macmillan, 


NPN erer Or? 
FisHer, A., The Mathematical Theory of Probabilities, Macmillan, 
INGEN 022 
An Elementary Treatise on Frequency Curves, Macmillan, 
NESS So22 


ForsytH, C. H., Mathematical Analysis of Statistics, Wiley, 
N. Y., 1924 

HASKELL, A. C., How to Make and Use Graphic Charts, Codex 
iBook Cac.N. Y<, 19010 

Jones, D. C., A First Course in Statistics, G. Bell & Sons, Lon- 
don, 1921 

KarstTEN, K. G., Charts and Graphs, Prentice-Hall, N. Y., 1923 

Kettey, T. L., Statistical Method, Macmillan, N. Y., 1923 

KEnt, F. C., Elements of Statistics, McGraw-Hill, N. Y., 1924 
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Kinc, W. I., Elements of Statistical Method, Macmillan, N. Y., 
1920 

Meee, W. C., Graphical Methods, McGraw-Hill, N. Y., 
I9g21 

Tea F. C., Statistical Methods — Applied to Economics and 
Business, Holt, N. Y., 1924 

PEARL, R., Medical Biometry and Statistics, W. B. Saunders Co., 
Philadelphia, 1923 

RIEGEL, R., Elements of Business Statistics, Appleton, N. Y., 
1924 

Rietz, H. L., Editor-in-Chief, Handbook of Mathematical Statis- 
tics, Houghton Mifflin, Boston, 1924 

Rucc, H. O., Statistical Methods A pplied to Education, Houghton 
Mifflin, Boston, 1917 

Secrist, H., An Introduction to Statistical Methods, revised edi- 
tion, Macmillan, N. Y., 1925 

Readings and Problems in Statistical Methods, Macmillan, 

IN2 Ys 1020 

THORNDIKE, E. L., An Introduction to the Theory of Mental and 
Social Measurements, Columbia Univ., N. Y., 1913 

WELD, L. D., Theory of Errors and Least Squares, Macmillan, 
NN TOr6 

West, C. J., Introduction to Mathematical Statistics, R. G. Adams, 
Columbus, 1918 

Youne, B. F., Statistics as Applied in Business, Ronald Press, 
Ne Y., 1925 

Yue, G. U., An Introduction to the Theory. of Statistics, 7th 
edition, C. Griffin & Co., London, 1924 

ZizEK, F., Statistical Averages. Translated by W. M. Persons. 
Holt New 1086 


The foregoing is a list of the most serviceable general refer- 
ences in English. Among the most useful treatises on statistical 
method in French and German are: 


CzuBER, E., Die statistichen Forschungs-methoden, L. W. Seidel & 
Sohn, Wien, 1921 

Juin, A., Principes de Statistique Théorique et A ppliquée, Riviere, 
Bruxelles, 1921 

KaurMann, A., Theorie und Methoden der Statistik, J. C. B. 
Mohr, Tiibingen, 1913 

Mayr, G. von, Statisitk und Gesellschaftslehre, vol. I, J. C. B. 
Mohr, Tiibingen, 1914 


/ 
/ 


/ 
INDEX 


Accidental fluctuations 


measurement of, 311; nature of, 241, 
310-312. 
Accuracy: factors affecting, 33-34; of 


statistical data, 377-378; prescribed, 33. 
Adequacy : of statistical data, 376-377. 
Arithmetic mean: bias of, in averaging 

of relatives, 345-346; calculation of, 

136-139; definition of, 135; presump- 

tion in favor of, 155, 157; 158}; use of, 

T60. 

Array: nature of, 64; use of, 64. 

Axes: 75, 100, 414. 

Ayres, LL. P.: 106, 197, 322. 

Asymmetry : as affecting the use of averages, 
158; common measures of, 176-177; 
Pearsonian measure of, 177; positive vs. 
negative, 178; quartile measure of, 177; 
relative merits of measures of, 178-179; 
significance of, 175-170. 

Average deviation: definition of, 165; 
virtues and defects of, as measure of 


variability, 173; computation of, 165- 
TO 
Averages: abstract vs. typical, 156-157; 


abuse of, 159; characteristics of good, 
153-154; five common forms of, 134; gen- 
eral nature of, 134; limited significance of, 
159-160; relative merits of five common 
formsof, 154-155 ; symbols for five common 
forms of, 135; use of abstract, 156, r40- 
162; use of typical, 156-158. 


Bailey, W. B.: 453. 

Bar charts: general form of, 53-57; hori- 
zontal vs. vertical forms of, 53-54, 50-60; 
standard form of, 413-414; use of, 59-61, 
IOQ-II0. 

Base (in index-number construction): 334, 
340, 350- 

Base-reversal test: 345-347. 

Berridge, W. A.: 322. 

Bias: in averages of relatives, 345-348; in 
use of weights, 365-366; of informants, 
33734: 

Binomial expansion: 122. 

Blackett, O. W.: 264. 

Bowley, A. L.: 150, 153, 180, 281, 287, 313, 
338, 339, 359, 379, 387, 452. 

Brinton, W. C.: 100, 218, 412, 453.. 


(in time series): | 


Chaddock, R. E.: 453. 

Chain relatives: 291-206. 

Chapin, F.S.: 453. 

Charlier, C. V. L.: 190. - 

Clark, Earle: 75. 

Clark, Wallace: 116. 

Classifications: attributive, 37-38, 40-42; 
by degree, 40; by kind, 40; contrasted 
with series, 46-47; forms of, 37-42; 
geographic, 37, 39; nature of, 36; quali- 
tative vs. quantitative, 40; simple vs. - 
multiple, 41-42; temporal, 37, 30. 

Classifications by kind: formulation of, 50- 
51; graphic representation of, 53-57; 
nature of, 48-50; statistical treatment of, 
Sissi 

Classifications of degree : graphic representa- 
tion of, 59-61; nature of, 57; qualitative 
vs. quantitative, 57-59; statistical treat- 
ment of qualitative, 59-61. 

Class intervals: in cumulative frequency 
distributions, 82; in simple frequency dis- 
tributions, 65-70; in spatial distributions, 
93-95; in temporal distributions, 105-107. 

Class limits: in simple frequency distribu- 
tions, 71-72; in spatial distributions, 95; 
in temporal distributions, 108. 

Class designations: in cumulative frequency 
distributions, 82; in simple frequency dis- 
tributions, 73-74; in spatial distributions, 
95; in temporal distributions, 108. 

Coefficient of correlation: see Correlation 
coefficient. 

Coefficient of regression: 
coefficient. 

Collection: “cultivation”’ of field in, 388; 
editing of returns in, 388; forms (or 
schedules) for, 386-388; methods of, 384— 
386; scope of, 383-384; through corre- 
spondence method, 384, 385-386; through 
field method, 384, 385. 

Column diagram: construction of, 75-77; 
use of, 77-80. 

Compound-interest curve: fitting of, to 
empirical data, 274-275; form of, 265. 

Continuous variables: 15-17. 

Coordinates: 414. 

Copeland, M. T.: 287. 

Correlation: appearance of, in time series, 
313-322; as shown by means of columns 


see Regression 
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or rows, 188; as shown by scatter dia- 
gram, 181-182; as shown in correlation 
table, 182-187; concepts underlying 
measurement of, 191-193; interpretation 
of measures or coefficients of, 209-210; 
measurement of, from rank data, 206-2009 ; 
measurement of, by Pearsonian coeffi- 
cient, 193-199; measurement of, by cor- 
relation ratio, 201-206; measurement of, 
in time series, 317-322; nature of, 188- 
189; necessity of pairing of items in, 180- 


181; significance of regression lines in, 
199-200; two distinct phases of, 189- 
190. 


Correlation coefficient: calculation of, 194- 
199, 318-322, 323; definition of, 193. 

Correlation ratio: calculation of, 204-206; 
definition of, 203. 

Correlation table: construction of, 183-186; 
form of, indicative of correlation, 186-187 ; 
nature of, 182-183. 

Correspondence (in time series): nature of, 
313-317; measurement of, 315-317: 

Crathorne, A. R.: 198, 205. 

Crum; W- L.:| 201,.300: 

Cummings, J.: 453. 

“Cycle” series: 310. 

Cyclical fluctuations: differences of ampli- 
tude in, 308-310; differences of duration 
in, 307-308; nature of, 238-240, 306-307 ; 
smoothing of, 307. 

Curve of error: 123-124, 311. 

Czuber, E.: 454. 


Davenport, C. B.: 453. 

Davies, G. R.: 87, 453. 

Dependent variable: 42-43. 

Deciles: as measures of variability, 171; 
definition of, 171. 

Deviation: see Variability. 

Discrete variables: 15-17. 

Dispersion: see Variability and Asymmetry. 

Distributions: 48. 

Distributive series: 48. 

Dot maps: comparable scales in, 223-224; 
construction of, 98-102; general nature, 
98-100, IoI—I02. 


Edgeworth, F. Y.: 348. 

Elderton, E. M.: 453. 

Elderton, W. P.: 453: 

Episodic movements: measures of, 304-306; 
nature of, 240-241, 302-304. 

Error: curve of, 122-123; sources of, 33-343 
probable, 378-379; types of, 378. 


Factor-reversal test: 350-360. 
Falkner, Helen D.: 206. 
Field, J. A.: 251. 


INDEX 


Fisher, A.: 453. 

Fisher, Irving: 251, 328, 343, 345, 350, 362. 

Flux, A. J.: 350. 

Forsyth, C. H.: 453. 

Fortuitous movements: see Accidental fluc- 
tuations. 

Frequency curve: construction of, 75-77; 
use of, 77-80; normal type of, 123-124. 

Frequency distributions: classification of, 
118-131; differences among, 131-133; 
extreme asymmetric type illustrated, 127; 
J-shaped, illustrated and explained, 127— 
129; moderately asymmetric type illus- 
trated, 126-127; multimodal forms of, 
130-132; nature of, 62; normal form of, 
122-124; relation between symmetric 
and asymmetric forms, 124-125; symmet- 
ric type by binomial expansion, 122; 
symmetric type illustrated, 118-120; sym- 
metric type from coin tossing, 121-122; 
symmetric type from random errors, 121; 
value of, 63-64, 74-75. 

Frequency distributions (cumulative): for- 
mulation of, 82; graphic representation of, 
82-89; nature of, 80-82. 

Frequency distributions (simple): class des- 
ignations in, 73-74; class intervals in, 
65-70; class limits of, 70-72; graphic 
representation of, 75-80; location of 
classes in, 70-71; on logarithmic scale, 
67-69. 

Frequency polygon: construction of, 75-77; 
use of, 77-80. 


Gantt chart: 116. 

Geographic distributions (and series) : 
Spatial. 

Geometric mean: calculation of, 140-141; 
definition of, 140; use of, 158, 160-161; 
virtues of, in averaging of relatives, 347— 
348; virtues of, in index-number con- 
struction, 347-348, 364, 366-367. 

Gompertz curve: fitting of, 274; form of, 
2605-266. 

Graded maps: construction of, 216-218; 
development of comparable scales in, 226- 
228; general nature of, 215-216. 

Graphic representation: of classifications 
by kind, 53-57; of classifications of de- 
gree, 59-61; of cumulative temporal dis- 
tributions, 114-115; of cumulative fre- 
quency distributions, 82-85; of rates of 
change, 248-256; of seasonal variation, 
300; of simple frequency distributions, 
75-80; of simple temporal distributions, 
r0og-113; of spatial distributions, 97-100, 
IoI-10o2; of spatial series, 215-230 3 
of time series, 234-235; purposes of, 
All. 


see 


INDEX 


Harmonic mean: bias of, in averaging of 
relatives, 346; calculation of, 142, 143; 
definition of, 141-142; use of, 161-162. 

Haskell, A. C.: 453. 

Helmle, W. C.: 88. 

Hettinger, A. J.: 204. 

Histogram: see Column diagram. 

Hollerith card: use of, in tabulation, 394- 
402. 

Homogeneity: of statistical data, 371-376. 

Hooker, R. H.: 317, 325. 

Huntington, E. V.: 192. 


“Tdeal”’ index-number form: limitations of, 
360-363; nature of, 358-359; virtues of, 
359-360; substitutes for, 364. 

Independent variable: 42-43. 

Index numbers: applications of, 330-334; 
as ratios of averages, 348-340, 352-353; 
base-reversal test of, 345-347; bias in 
weighting of, 365-366; circular test of, 
361, 362; as averages of relatives, 344— 
345; definition of, 328-329; factor-re- 
versal test of, 359-360; function of, 3209; 
fundamental forms of, 342-344; general 
nature of, 328-330; ‘‘ideal’’ form of, 358- 
360; importance of purpose of, 334-330; 
in form of chain indexes, 362-363; law 
of proportionality in, 344-345; peculi- 
arities of data as affecting form of, 
366-367; problems involved in construc- 
tion of, 334; problems of weighting in, 
338-340, 342; questions of representa- 
tiveness in, 336-338, 350-351; relative 
ease of computation of, 367; selection of 
bases for, 340; selection of data for, 336- 
338; substitutes for “ideal” form of, 364; 
tests of validity in, 345, 359-360, 361, 362; 
time-reversal test (see Base-reversal test) ; 
type bias in, 345-347; virtues of geometric 
mean in calculation of, 347-348; with fixed 
weights, 353-355; with variable weights, 
355-358. 


Joint Committee on Standards for Graphic 
Presentation: report of, 415-4109. 

Jones. Dy (Coz Ass. 

J-shaped distributions, 127-120. 

Julin, A.: 316, 454. 


Karsten, K. G.: 453. 
Kaufmann, A.: 454. 
Kelley, T. L.: 200, 453. 
Kent, Fo Cy: 4535 

Keynes, J. M., 380-381. 
King, W.1.: 176, 258, 454. 
Knapp, J.Gae 375. 


Lag: importance of, 326-327; measure- 
ment of, 324-326; nature of, 322-324. 
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Line charts: standard form of, 414-417; use 
of, 76-80, III-113, 234-235, 248-257. 

Lines of best fit: forms of, 263-200. 

Link relatives: 289-296. 

Logarithmic charts: construction of, 252—- 
253; interpretation of, 253-254; purpose 
of, 249-251; virtues of, 257. 

Logical consistency: of statistical data, 371. 

Lorenz curve: 85-89. 

Lorenz, M. O.: 85, 86. 


Magee, J. D.: 316, 317. 

March, L.: 316. 

Marshall, A.: 250. 

Marshall, W. C.: 454. 

Mayr, G. von: 454. 

Means: see Averages. 

Mean deviation: see Average deviation. 

Median: definition of, 142-144; determina- 
tion of, 144-147; freedom from bias in 
averaging of relatives, 347; use of, 158-159. 

Mills, F. C.: 209, 274, 454. 

Mitchell, W. C.: 328, 330, 348. 

Mode: definition of, 147; determination of, 
148-151; freedom from bias in averaging 
of relatives, 347; use of, 158. 

Moore, H. L.: 269. 

Morris, Ray: 373-374. 

Moving average: computation of, 261; 
difficulties in use of, 261-263; nature of, 
201. 

Multiple correlation: 209. 


Ogive: see Frequency distributions (cumu- 
lative). 

Original items: symbols for, 135. 

Original observation: complete vs. partial, 
32-33; differentiation as to attribute in, 
28-30; differentiation as to location in, 
30-31; differentiation as to time in, 31; 
differentiation in, 28-31; limits of attri- 
bute in, 26-28; limits of space in, 27; 
limits of time in, 27; specification of 
limits in, 25-28. 


Pearl, Raymond: 8, 369, 370, 454. 

Pearson, Karl, 174, 193. 

Pearsonian, coefficient of correlation : calcula- 
tion of, 194-109, 318-322, 323 ; characteris- 
tics of, 193-194; definition of, 193; nature 
of, 193-104. 

Percentiles: as measures of variability, 172; 
definition of, 172. 

Periodic movements: nature of, 238, 281-287. 

Persons, W. M.: 280, 280, 294, 311, 317, 318, 
362, 379-381. 

Piper, C. Bs 204- 

Plotting: over intervals, or on lines, in time 
charts, I1I-112. 
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Polakov, W. N.: 116. 

Popewell, F.: 299. 

Prescott, R. B.: 255, 266, 274. 

Probability : concept of, in interpretation of 
statistical results, 379-381. 

Probable error: 378-370. 

Punch cards: use of, in tabulation, 394-402. 


Quartile deviation: calculation of, 171; 
definition of, 170; use of, 173-174. 

Quartiles: location of, 171, 172. 

Questionnaire method: 385-388. 

Random selection: 32-33. 

Rank coefficient of correlation: calculation 


of, 206-209; definition of, 206. 

Rates and ratios: as variables, 12-15; crude 
vs. refined, 13. 

Relevancy : of statistical results, 370-371. 

Regression coefficient : 201. 

Regression lines: determination of, 200-201 ; 
nature of, 199-200. 

Residual movements: nature of, 302. 

Riegel, R.: 454. 

Rietz, H.L.: 454. 

Ripley, W. Z.: 218. 

Rugg, H. O.: 70, 454. 


Sampling: 32-33, 376-377. 

Scatter diagram: construction of, 181-182; 
nature of, 181-182. 

Schedules: 386-388. 

Seasonal variation: 238-239; changes in, 
298-300; elimination of, 300-301; graphic 
representation of, 300; interpretation of 
indexes of, 297-2908, 301; measurement 
of, 287-207; measurement of, by method 
of link relatives, 289-296; nature of, 28r. 

Secondary data: accuracy and complete- 
ness of, 391; record of manner in which 
collected, 390-391; record of sources of, 
389-390; questions regarding significance 
of, 391. 

Secrist, H.: 370, 454. 

Secular trend: see Trend. 

Semi-logarithmic curves: 
charts. 

Seriation: contrasted with 
40-47; nature of, 42. 

Series: forms of, 43-46; nature of, 42-43. 

Skewness: see Asymmetry. 

Smith ek Cas 287, 

Smoothing of curve: 77, 307. 

Snyder, C.: 333. 

Spatial distributions: class areas in, 93-95; 
class limits and designations in, 95; 
graphic representation of, 97-100, 101-102 ; 
importance of, 100, 103; linear vs. areal 
forms of, 92-93 ; nature of, 90-92; tabular 
presentation of, 95-97. 


see Logarithmic 


classification, 
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Spatial series: application of correlation 
analysis to, 230; formulation of, 212-213; 
graphic comparison of, 218-230; linear 
forms of, 214; nature of, 211-212. 

Stability: of statistical data, 377. 

Standard deviation: computation of, 168- 
169; definition of, 167-168; presumption 
in favor of use of, 173. 

Statistical analysis: contrasted with collec- 
tion and tabulation, 8-9. 

Statisticalcharts : standardization of, 413-420. 

Statistical data: accuracy of, 377-378; 
adequacy of, 376-377; considerations 
involved in interpretation of, 370; dis- 
tinguishing feature of, 368; homogeneity 
of, 371-376; logical consistency of, 371; 
relevancy of, 370-371; stability of, 377. 

Statistical distributions: see Distributions. 

Statistical method; contrasted with experi- 
mental, 368-370. 

Statistical results: general nature of, 381. 

Statistical tables: standardization of, 403— 
410; general-purpose vs. special-purpose, 
404-405. 

Statistics: contributions of: 4-5; defini- 
tion of, 1-2; increasing influence of, 6-8; 
limits of, 5-6; objectives of, 6; three 
phases of, 8-9; value of, 2-3. 

Straight line: fitting of, to empirical data, 
208-274. 

Symmetry: see Asymmetry. 


Table construction: arrangement of columns 
and rows in, 405-410; commonly recog- 
nized rules for, 404; standardization of, 
403, 410. 

Tabulation: by machine processes, 394-402; 
by sorting and counting of returns, 393; 
through use of tally sheet, 393-304; use 
of codes in, 396-401; use of punch-cards 
in, 394-402. 

Temporal distributions (cumulative): for- 
mulation of, 113-114; graphic representa- 
tion of, rr4—115; use of, in following busi- 
ness factors, 116-117. 

Temporal distributions (simple): class in- 
tervals in, 105-107; class limits and desig- 
nations in, 107-108; graphic representa- 
tion of, rog—113; nature of, 104. 

Thorndike, E. L.: 454. 

Thurston, L. L.: 199. 

Time-reversal test: 345-347. 

Time , series: combinations of movement 
in, 242-245; formulation of, 233-234; 
graphic comparison of, 313-315; incre- 
ments of change in, 246-247; lead and 
lag’ in, 322-327; nature of, 231-245; 
rates of change in, 246-248; types of 
movement in, 235—237- 


INDEX 


Trend: determination of, by line of best fit, 


263-274; determination of form of, 263- 
266; determination of, by method of free- 


hand drawing, 260; determination of, by | 


method of moving average, 263; elimina- 
tion of, 274-280; methods of determining, 
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derived and compound, 12-15; from com- 
putations, 11; from measurements, 11; 
from tabulations, 11; general nature of, 
10; independent and dependent, 42-43; 
rates and relatives as, 12-15; simple 
averages as, II-I2. 


259-260; nature of, 238, 258-259; selec- | Venn, J.: 155. 
tion of period of, 267-268. 
Walsh, C. M.: 328, 350. 
Units: abstract units of measurement, 22-23 ; Watkins, G.P.: 17; 403. é 

compound units of measurement, 22;| Weighted averages: calculation of, 152; 


institutional objects as, 19-20; natural| nature of, 151152. ints 
objects as, 18; of measurement, 21-23; | Weighting: importance of, in index-number 


pecuniary units of measurement, 21-22; construction, 338-340; importance of, in 
physical units of measurement, 21; pro- averaging of rates and ratios, 152. 
duced objects as, 18-19; qualified objects Weld, L. D.: 454. 
as, 20; the individual case as a unit, 17- Wenzel, J.: 254. 
20. West, C.J.: 110; 454. 
U-shaped distributions: 118. Wright, P. G.: 306. 


Young, A. A.: 328, 359. 

Young, B. F.: 454. 

Yule, G. Us) 2,501 70,74, Tio te4y rsa 155s 
173, 199, 209, 318, 368, 454. 


Variability: coefficients of, 174-175; com- 
mon measures of, 164-172; relative merits 
of measures of, 173-175; significance of, 
163-164; symbols for common measures 
of, 165. Zero: inclusion of, in vertical scale of charts; 

Variables: conditions under which emerge, III. 

23-24; continuous and discrete, 15-17;| Zizek, F.: 144, 371, 454. 
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